Method and System for DNA Analysis

ABSTRACT

The present invention pertains to a process for automatically analyzing nucleic acid samples. Specifically, the process comprises the steps of forming electrophoretic data of DNA samples with DNA ladders; comparing these data; transforming the coordinates of the DNA sample&#39;s data into DNA length coordinates; and analyzing the DNA sample in length coordinates. This analysis is useful for automating fragment analysis and quality assessment. The automation enables a business model based on usage, since it replaces (ratter than assists) labor. This analysis also provides a mechanism whereby data generated on different instruments can be confidently compared. Genetic applications of this invention include gene discovery, genetic diagnosis, and drug discovery. Forensic applications include identifying people and their relatives, catching perpetrators, analyzing DNA mixtures, and exonerating innocent suspects.

FIELD OF TRY INVENTION

The present invention pertains to a process for analyzing a DNAmolecule. More specifically, the present invention is related toperforming experiments that produce quantitative data, and thenanalyzing these data to characterize a DNA fragment. The invention alsopertains to systems related to this DNA fragment information.

BACKGROUND OF THE INVENTION

With the advent of high-throughput DNA fragment analysis byelectrophoretic separation, many useful genetic assays have beendeveloped. These assays have application to genotyping, linkageanalysis, genetic association, cancer progression, gene expression,pharmaceutical development, agricultural improvement, human identity,and forensic science.

However, these assays inherently produce data that have significanterror with respect to the size and concentration of the characterizedDNA fragments. Much calibration is currently done to help overcome theseerrors, including the use of in-lane molecular weight size standards. Inspite of these improvements, the variability of these properties(between different instruments, runs, or lanes) can exceed the desiredtolerance of the assays.

Recently, advances have been made in the automated scoring of geneticdata. Many naturally occurring artifacts in the amplification andseparation of nucleic acids can be eliminated through calibration andmathematical processing of the data on a computing device (M W Perlin, MB Burks, R C Hoop, and E P Hoffman, “Toward fully automated genotyping:allele assignment, pedigree construction, phase determination, andrecombination detection in Duchenne muscular dystrophy,” Am. J. Hum.Genet., vol. 55, no. 4, pp. 777-787, 1994; M W Perlin, G Lancia, and S-KNg, “Toward fully automated genotyping: genotyping microsatellitemarkers by deconvolution,” Am. J. Hum. Genet., vol. 57, no. 5, pp.1199-1210, 1995; S-K Ng, “Automating computational molecular genetics:solving the microsatellite genotyping problem,” Carnegie MellonUniversity, Doctoral dissertation CMU-CS-98-105, Jan. 23, 1998),incorporated by reference.

This invention pertains to the novel use of calibrating data andmathematical analyses to Computationally eliminate undesirable dataartifacts in a nonobvious way. Specifically, the use of allelic laddersand coordinate transformations can help an automated data analysissystem better reduce measurement variability to within a desired assaytolerance. This improved reproducibility is useful in that it results ingreater accuracy, and more complete automation of the genetic assays,often taking less time at a lower cost with fewer people.

Genotyping Technology

Genotyping is the process of determining the alleles at an individual'sgenetic locus. Such loci can be any inherited DNA sequence in thegenome, including protein-encoding genes and polymorphic markers. Thesemarkers include short tandem repeat (STR) sequences; single-nucleotidepolymorphism (SNP) sequences, restriction fragment length polymorphism(RFLP) sequences, and other DNA sequences that express genetic variation(G Gyapay, J Morissette, A Vignal, C Dib, C Fizames, P Millasseau, SMarc, G Bernardi, M Lathrop, and J Weissenbach, “The 1993-94 GenethonHuman Genetic Linkage Map,” Nature Genetics, vol. 7; no. 2, pp. 246-339,1994; F W Reed, J L Davies, J B Copeman, S T Bennett, S M Palmer, L EPritchard, S C L Gough, Y Kawaguchi; H J Cordell, K M Balfour, S CJenkins, E E Powell, A Vignal, and J A Todd, “Chromosome-specificmicrosatellite sets for fluorescence-based, semi-automated genomemapping,” Nature Genet., vol. 7, no. 3, pp. 390-395, 1994; L Kruglyak,“The use of a genetic map of biallelic markers in linkage studies,”Nature Genet., vol. 17, no. 1, pp. 21-24, 1997; D Wang, J Fan, C Siao, ABerno, P Young, R Sapolsky, G Ghandour, N Perkins, E Winchester, JSpencer, L Kruglyak, L Stein, L Hsie, T Topaloglou, E Hubbell, ERobinson, M Mittmann, M Morris, N Shen, D Kilburn, J Rioux, C Nusbaum, SRozen, T Hudson, and E Lander, “Large-scale identification, mapping, andgenotyping of single-nucleotide polymorphisms in the human genome,”Science, vol. 280, no. 5366, pp. 1077-82, 1998; P Vos, R Hogers, MBleeker, M Reijans, T van de Lee, M Hornes, A Frijters, J Pot, JPeleman, M Kuiper, and M Zabeau, “AFLP: a new technique for DNAfingerprinting,” Nucleic Acids Res, vol. 23, no. 21, pp. 4407-14, 1995;J Sambrook, E F Fritsch, and T Maniatis, Molecular Cloning, SecondEdition. Plainview, N.Y.: Cold Spring Harbor Press, 1989), incorporatedby reference.

The polymorphism assay is typically done by characterizing the lengthand quantity of DNA from an individual at a marker. For example, STRsare assayed by polymerase chain reaction (PCR) amplification of anindividual's STR locus using a labeled PCR primer, followed by sizeseparation of the amplified PCR fragments. Detection of the fragmentlabels, together with in-lane size standards, generates a signal thatpermits characterization of the size and quantity of the DNA fragments.From this characterization, the alleles of the STR locus in theindividual's genome can be determined (J Weber and P May, “Abundantclass of human DNA polymorphisms which can be typed using the polymerasechain reaction,” Am. J. Hum. Genet., vol. 44, pp. 388-396, 1989; J SZiegle, Y Su, K P Corcoran, L Nie, P E Mayrand, L B Hoff, L J McBride, MN Kronick, and S R Diehl, “Application of automated DNA sizingtechnology for genotyping microsatellite loci,” Genomics, vol. 14,pp.1026-1031, 1992), incorporated by reference.

The labels can use radioactivity, fluorescence, infrared, or othernonradioactive labeling methods (F M Ausubel, R Brent, R E Kingston, D DMoore, J G Seidman, J A Smith, and K Struhl, ed., Current Protocols inMolecular Biology. New York, N.Y.: John Wiley and Sons, 1995; N JDracapoli, J L Haines, B R Korf, C C Morton, C E Seidman, J G Seidman, DT Moir, and D Smith, ed., Current Protocols in Human Genetics. New York:John Wiley and Sons, 1995; L J Kricka, ed., Nonisotopic Probing,Blotting, and Sequencing, Second Edition. San Diego, Calif.: AcademicPress; 1995), incorporated by reference.

Size separation of fragment molecules is typically done using gel orcapillary electrophoresis (CE); newer methods include mass spectrometryand microchannel arrays (R A Mathies and X C Huang; “Capillary arrayelectrophoresis: an approach to high-speed, high-throughput DNAsequencing,” Nature, vol. 359, pp. 167-169, 1992; K J Wu, A Stedding,and C H Becker, “Matrix-assisted laser desorption time-of-flight massspectrometry of oligonucleotides using 3-hydroxypicolinic acid as anultraviolet-sensitive matrix,” Rapid Common. Mass Spectrom., vol. 7, pp.142-146, 1993), incorporated by reference.

The label detection method is contingent on both the labels used and thesize separation-mechanism. For example, with automated DNA sequencerssuch as the PE Biosystems ABI/377 gel, ABI/310 single capillary orABI/3700 capillary array-instruments, the detection is done by laserscanning of the fluorescently labeled fragments, imaging on a CCDcamera, and electronic acquisition of the signals from the CCD camera.Flatbed laser scanners, such as the Molecular Dynamics Fluorimager orthe Hitachi FMBIO/II acquire fluorescent signals similarly. Li-Cor'sinfrared automated sequencer uses a detection technology modified forthe infrared range. Radioactivity can be detected using film or phosphorscreens. In mass spectrometry, the atomic mass can be used as asensitive label. See (A. J. Kostichka, Bio/Technology, vol. 10, pp. 78;1992), incorporated by reference.

Size characterization is done by comparing the sample fragment's signalin the context of the size standards. By separate calibration of thesize standards used, the relative molecular size can be inferred. Thissize is usually only an approximation to the true size in base pairunits, since the size standards and the sample fragments generally havedifferent chemistries and electrophoretic migration patterns (S-K Ng,“Automating computational molecular genetics: solving the microsatellitegenotyping problem,” Carnegie Mellon University, Doctoral dissertationCMU-CS-98-105, Jan. 23, 1998), incorporated by reference.

Quantitation of the DNA signal is usually done by examining peak heightsor peak areas. One inexact peak area method simply records the areaunder the curve; this approach does not account for band overlap betweendifferent peaks. It is often useful to determine the quality (e.g.,error, accuracy, concordance with expectations) of the size or quantitycharacterizations. See (D R Richards and M W Perlin, “Quantitativeanalysis of gel electrophoresis data for automated genotypingapplications,” Amer. J. Hum. Genet., vol. 57, no. 4 Supplement, pp. A26,1995), incorporated by reference.

The actual genotyping result depends on the type of genotype, thetechnology used, and the scoring method. For example, with STR data,following size separation and characterization, the sizes (exact,rounded, or binned) of the two tallest peaks might be used as thealleles. Alternatively, PCR artifacts (e.g., stutter, relativeamplification) can be accounted for in the analysis, and the allelesdetermined after mathematical corrections have been applied. See (M WPerlin, “Method and system for genotyping,” U.S. Pat. No. 5,541,067,Jul. 30, 1996; M W Perlin, “Method and system for genotyping,” U.S. Pat.No. 5,580,728, Dec. 3, 1996), incorporated by reference.

Genotyping Applications

Genotyping data can be used to determine how mapped markers are sharedbetween related individuals. By correlating this sharing informationwith phenotypic traits, it is possible to localize a gene associatedwith that inherited trait. This approach, is widely used in geneticlinkage and association studies (J Ott, Analysis of Human GeneticLinkage, Revised Edition. Baltimore, Md.: The Johns Hopkins UniversityPress, 1991; N Risch, “Genetic, Linkage and Complex Diseases, WithSpecial Reference to Psychiatric Disorders,” Genet. Epidemiol., vol. 7,pp. 3-16, 1990; N Risch and R Merikangas, “The future of genetic studiesof complex human diseases,” Science, vol. 273, pp. 1516-1517, 1996),incorporated by reference.

Genotyping data can also be used to identify individuals. For example,in forensic science, DNA evidence can connect a suspect to the scene ofa crime. DNA databases can provide a repository of such relationalinformation (C P Kimpton, P Gill, A Walton, A Urquhart, E S Millican,and M Adams, “Automated DNA profiling employing multiplex amplificationof short tandem repeat loci,” PCR Meth. Appl., vol. 3, pp. 13-22, 1993;J E McEwen, “Forensic DNA data banking by state crime laboratories,” Am.J. Hum. Genet., vol. 56, pp. 1487-1492, 1995; K Inman and N Rudin, AnIntroduction to Forensic DNA Analysis. Boca Raton, Fla.: CRC Press,1997; C J Fregeau and R M Fourney, “DNA typing with fluorescently taggedshort tandem repeats: a sensitive and accurate approach to humanidentification,” Biotechniques, vol. 15, no. 1, pp. 100-119, 1993),incorporated by reference.

Linked genetic markers can help predict the risk of disease. Inmonitoring cancer, STRs are used to assess microsatellite instability(MI) and loss of heterozygosity (LOH)—chromosomal alterations thatreflect tumor progression. (I D Young, Introduction to Risk Calculationin Genetic Counselling. Oxford: Oxford University Press, 1991; LCawkwell, L Ding, F A Lewis, I Martin, M F Dixon, and P Quirke,“Microsatellite instability in colorectal cancer: improved assessmentusing fluorescent polymerase chain reaction,” Gastroenterology, vol.109, pp. 465-471, 1995; F Canzian, A Salovaara, P Kristo, R B Chadwick,L A Aaltonen, and A de la Chapelle, “Semiautomated assessment of loss ofheterozygosity and replication error in tumors,” Cancer Research, vol.56, pp. 3331-3337, 1996; S Thibodeau, G Bren, and D Schaid,“Microsatellite instability in cancer of the proximal colon,” Science,vol. 260, no. 5109, pp. 816-819, 1993), incorporated by reference.

For crop and primal improvement, genetic mapping is a very powerfultool. Genotyping can help identify useful traits of nutritional oreconomic importance. (H J Vilkki, D J de Koning, K Elo, R Velmala, and AMaki-Tanila, “Multiple marker mapping of quantitative trait loci ofFinnish dairy cattle by regression,” J. Dairy Sci., vol. 80, no. 1, pp.198-204, 1997; S M Kappes, J W Keele, R T Stone, R A McGraw, T SSonstegard, T P Smith, N L Lopez-Corrales, and C W Beattie, “Asecond-generation linkage map of the bovine genome,” Genome Res., vol.7, no. 3, pp. 235-249, 1997; M Georges, D Nielson, M Mackinnon, AMishra, R Okimoto, A T Pasquino, L S Sargeant, A Sorensen, M R Steele,and X Zhao, “Mapping quantitative trait loci controlling milk productionin dairy cattle by exploiting progeny testing,” Genetics, vol. 139, no.2, pp. 907-920, 1995; G A Rohrer, L J Alexander, Z Hu, T P Smith, J WKeele, and C W Beattie, “A comprehensive map of the porcine genome,”Genome Res., vol. 6, no. 5, pp. 371-391, 1996; J Hillel, “Map-basedquantitative trait locus identification,” Poult. Sci., vol. 76, no. 8,pp. 1115-1120, 1997; H H Cheng, “Mapping the chicken genome,” Poult.Sci., vol. 76, no. 8, pp. 1101-1107, 1997), incorporated by reference.

Other Sizing Assays

Fragment analysis finds application in other genetic methods. Oftenfragment sizes are used to multiplex many experiments into one sharedreadout pathway, where size for size range) serves an index intopost-readout demultiplexing. For example, multiple genotypes aretypically pooled into a single lane for more efficient readout.Quantifying information can help determine the relative amounts ofnucleic acid products present in tissues. (G R Taylor, J S Noble, and RF Mueller, “Automated analysis of multiplex microsatellites,” J. Med.Genet., vol. 31, pp. 937-943, 1994; L S Schwartz, J Tarleton, BPopovich, W K Seltzer, and E P Hoffman, “Fluorescent multiplex linkageanalysis and carrier detection for Duchenne/Becker muscular dystrophy,”Am. J. Hum. Genet., vol. 51, pp. 721-729, 1992; C P Kimpton, P Gill, AWalton, A Urquhart, E S Millican, and M Adams, “Automated DNA profilingemploying multiplex amplification of short tandem repeat loci,” PCRMeth. Appl., vol. 3, pp. 13-22, 1993), incorporated by reference.

Differential display is a gene expression assay. It performs a reversetranscriptase PCR (RT-PCR) to capture the state of expressed mRNAmolecules into a more robust DNA form. These DNAs are then sizeseparated, and the size bins provide an index into particular molecules.Variation at a size bin between two tissue assays is interpreted as aconcommitant variation in the underlying mRNA gene expression profile. Apeak quantification at a bin estimates the underlying mRNAconcentration. Comparison of the quantitation of two different samplesat the same bin provides a measure of relative up- or down-regulation ofgene expression. (S W Jones, D Cai, O S Weislow, and B Esmaeli-Azad,“Generation of multiple mRNA fingerprints using fluorescence-baseddifferential display and an automated DNA sequencer,” BioTechniques,vol. 22, no. 3, pp. 536-543, 1997; P Liang and A Pardee, “Differentialdisplay of eukaryotic messenger RNA by means of the polymerase chainreactions,” Science, vol. 257, pp. 967-971, 1992; K R Luehrsen, L LMarr, E van der Knaap, and S Cumberledge, “Analysis of differentialdisplay RT-PCR products using fluorescent primers and Genescansoftware,” BioTechniques, vol. 22, no. 1, pp. 168-174, 1997),incorporated by reference.

Single stranded conformer polymorphism (SSCP) is a method for detectingdifferent mutations in a gene. Single base pair changes can markedlyaffect fragment mobility of the conformer, and these mobility changescan be detected in a size separation assay. SSCP is of particular use inidentifying and diagnosing genetic mutations (M Orita, H Iwahana, HKanazawa, K Hayashi, and T Sekiya, “Detection of polymorphisms of humanDNA by gel electrophoresis as single-strand conformation polymorphisms,”Proc Natl Acad Sci USA, vol. 86, pp. 2766-2770, 1989), incorporated byreference.

The AFLP technique provides a very powerful DNA fingerprinting techniquefor DNAs of any origin or complexity. AFLP is based on the selective PCRamplification of restriction fragments from a total digest of genomicDNA. The technique involves three steps: (i) restriction of the DNA andligation of oligonucleotide adapters, (ii) selective amplification ofsets of restriction fragments, and (iii) gel analysis of the amplifiedfragments. PCR amplification of restriction fragments is achieved byusing the adapter and restriction site sequence as target sites forprimer annealing. The selective amplification is achieved by the use ofprimers that extend into the restriction fragments, amplifying onlythose fragments in which the primer extensions match the nucleotidesflanking the restriction sites. Using this method, sets of restrictionfragments may be visualized by PCR without knowledge of nucleotidesequence. The method allows the specific co-amplification of highnumbers of restriction fragments. The number of fragments that can beanalyzed simultaneously, however, is dependent on the resolution of thedetection system. Typically 50-100 restriction fragments are amplifiedand detected on denaturing polyacrylamide gels. (P Vos, R Hogers, MBleeker, M Reijans, T van de Lee, M Barnes, A Frijters, J Pot, JPeleman, M Kuiper, and M Zabeau, “AFLP: a new technique for DNAfingerprinting,” Nucleic Acids Res, vol. 23, no. 21, pp. 4407-14, 1995),incorporated by reference.

Data Scoring

The final step in any fragment assay is scoring the data. This istypically done by having people visually review every experiment. Somesystems (e.g., PE informatics' Genotyper program) perform an initialcomputer review of the data, to make the manual visual review of everygenotype easier. More advanced systems (e.g., Cybergenetics' TrueAlleletechnology) fully automate the data review, and provide data qualityscores that can be used to identify data artifacts (for eliminating suchdata from consideration) and rank the data scores (to focus on just the2%-25% of suspect data calls). See (B Pálsson, F Pálsson, M Perlin, HGubjartsson, K Stefánsson, and J Gulcher, “Using quality measures tofacilitate allele calling in high-throughout genotyping,” GenomeResearch, vol. 9, no. 10, pp. 1002-1012, 1999; M W Perlin, “Method andsystem for genotyping,” U.S. Pat. No. 5,876,933, Mar. 2, 1999),incorporated by reference.

However, even with such advanced scoring technology, artifacts canobscure the results. More importantly, insufficient data calibration canpreclude the achievement of very low (e.g., <1%) data error rates,regardless of the scoring methods, For example, in high-throughput STRgenotyping, differential migration of a sample's PCR fragments relativeto the size standards can produce subtle shifts in detected size. Thisproblem is worse when different instruments ate used, or when sizeseparation protocols are not entirely uniform. The result is thatfragments can be incorrectly assigned to allele bins in a way thatcannot be corrected without recourse to additional information (e.g.,pedigree data) completely outside the STR sizing assay.

Whole System

This invention centers on a new way to greatly reduce sizing andquantitation errors in fragment analysis. By designing data generationexperiments that include the proper calibration data (e.g., internallane standards, allelic ladders, uniform run conditions), most of thesefragment analysis errors cat be eliminated entirely. Moreover, computersoftware can be devised that fully exploits these data calibrations toautomatically identify artifacts and rank the data by quality. Theresult is a largely error-free system that requires minimal (if any)human intervention.

SUMMARY OF THE INVENTION

The present invention pertains to a method for analyzing a nucleic acidsample. The method comprises the steps of forming labeled DNA samplefragments from a nucleic acid sample. Then there is the step of sizeseparating and detecting said sample fragments to form a sample signal.Then there is the step of forming labeled DNA ladder fragmentscorresponding to molecular lengths. Then there is the step of sizeseparating and detecting said ladder fragments to form a ladder signal.Then there is the step of transforming the sample signal into lengthcoordinates using the ladder signal. Then there is the step of analyzingthe nucleic acid sample signal in length coordinates.

The present invention also pertains to a system for analyzing a nucleicacid sample. The system comprises means for forming labeled DNA samplefragments from a nucleic acid sample. The system further comprises meansfor size separating and detecting said sample fragments to form a samplesignal, said separating and detecting means in communication with thesample fragments. The system further comprises means for forming labeledDNA ladder fragments corresponding to molecular lengths. The systemfurther comprises means for size separating and detecting said ladderfragments to form a ladder signal, said separating and detecting meansin communication with the ladder fragments. The system further comprisesmeans for transforming the sample signal into length coordinates usingthe ladder signal, said transforming means in communication with thesignals. The system further comprises means for analyzing the nucleicacid sample signal in length coordinates, said analyzing means incommunication with the transforming means.

The present invention also pertains to a method for generating revenuefrom computer scoring of genetic data. The method comprises the stews ofsupplying a software program that automatically scores genetic data.Then there is the step of forming genetic data that can be scored by thesoftware program. Then there is the step of scoring the genetic datausing the software program to form a quantity of genetic data. Thenthere is the step of generating a revenue from computer scoring ofgenetic data that is related to the quantity.

The present invention also pertains to a method for producing a nucleicacid analysis. The method comprises the steps of analyzing a firstnucleic acid sample on a first size separation instrument to form afirst signal. Then there is the step of analyzing a second nucleic acidsample on a second size separation instrument to form a second signal.Then there is the step of comparing the first signal with the secondsignal in a computing device with memory to form a comparison. Thenthere is the step of producing a nucleic acid analysis of the twosamples from the comparison that is independent of the size separationinstruments used.

The present invention also pertains to a method for resolving DNAmixtures. The method comprises the steps of obtaining DNA profile datathat include a mixed sample. Then there is the step of representing thedata in a linear actuation. Then there is the step of deriving asolution from the linear equation. Then there is the step of resolvingthe DNA mixture from the solution.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the steps of creating sized profiles.

FIG. 2 shows unimodal plots, each corresponding to a fluorescent dye;within each plot, intensity is plotted against the sampled spectrum.

FIG. 3 shows the unimodality constraint determining the function spacegeometry of the spectrum sampling vectors.

FIG. 4 shows the results of signal processing, size-tracking, and sizetransformation.

FIG. 5 shows the steps of quantitating and analyzing genetic data.

FIG. 6 shows the results of ladder processing, peak quantitation andallele calling.

FIG. 7 shows a graphical user interface for navigating prioritizedgenotyping results.

FIG. 8 shows a textual interface for displaying useful genotype results.

FIG. 9 shows a visualization that is customized to a data artifact.

FIG. 10 shows a system for analyzing a nucleic acid sample.

FIG. 11 shows the result of a differential display gene expressionanalysis.

FIG. 12 shows the flow graph of automated software assembly.

FIG. 13 shows a spreadsheet for calculating the labor cost of scoringgenetic data.

FIG. 14 shows a heuristic function dev(g(w)) which has an unambiguouslocal minimum.

DESCRIPTION OF THE PREFERRED EMBODIMENT Data Generation

In the most preferred embodiment, genotyping data is generated using STRmarkers. These tandem repeats include mono-, di-, tri-, tetra-, penta-,hexa-, hepta-, octa-, nona-, deca- (and so on) nucleotide repeatelements. STRs are highly abundant and informative marker distributedthroughout the genomes of many species (including human). Typically,STRs are labeled, PCR amplified, and then detected (for size andquantity) on an electrophoretic gel.

The laboratory processing starts with the acquisition of a sample, andthe extraction of its DNA. The extraction and purification are typicallyfollowed by PCR amplification. Labelling is generally done using a 5′labeled PCR primer, or with incorporation labeling in the PCR. Prior toloading, multiple marker PCR products in k−1 different fluorescentcolors are pooled, and size standards (preferably in a k^(th) differentcolor) is added. Size separation and detection is preferably done usingautomated fluorescent DNA sequencers, with either slab gel or capillarytechnology. The detected signals represent the progression of DNA bandsas a function of time. These signals are transmitted from the sequencerto a computing device with memory, where they are stored in a file. (N JDracopoli, J L Haines, B R Korf, C C Morton, C E Seidman, J G Seidman, DT Moir, and D Smith, ed., Current Protocols in Human Genetics. New York;John Wiley and Sons, 1995), incorporated by reference.

To create STR allelic ladders, the most preferred embodiment entails PCRamplification of Pooled samples. This can be done by preparing DNA fromN (Preferably, N is between 2 and 200, depending on the application)individuals in equimolar 3 ng/ul concentrations. These Dias are thenpooled. After dilution, each PCR template contains contains 48 ng of DNAin an 18 ul volume, and is included in a standard 50 ul PCR Containing2.5 units of Amplitaq Gold, 1.25 ul of each primer, 200 uM dNTP's, and2.5 mM MgCl₂. This mixture is then PCR amplified with its STR primers(one labeled) on a thermocycler (e.g., with an MJ Research PTC-100, use30 cycles of 94° C. for 1.25′, 55° C. for 1′, and 72° C. for 1′). Sizeseparation of the PCR products on an ABI sequencer includes internallane size standards (GS500 ROX-labeled 50 bp sizing ladder,Perkin-Elmer, Foster City, Calif.; 20 bp MapMarkers sizing ladder,BioVentures, Murfreesboro, Tenn.). Files are similarly recorded fromthis experiment.

In an alternative preferred embodiment, multiplexed SNP data isgenerated using size standards with standard protocols. Typically, eachsize bin in the electrophoretic signal corresponds to one marker orpolymorphism. Presence in the size bin of a signal of sufficientstrength and correct color indicates the presence of an allele; absenceof the signal indicates allele absence. Signal size and color establishthe allele, while signal strength determines the amount of DNA present(if any). (N J Dracopoli, J L Haines, B R Korf, C C Morton, C E Seidman,J G Seidman, D T Moir, and D Smith, ed.; Current Protocols in HumanGenetics. New York: John Wiley and Sons, 1995), incorporated byreference.

In another alternative preferred embodiment, differential display datais generated, preferably as follows: Making cDNA. The mRNA differentialdisplay reverse transcription-polymerase chain reaction (DDRT-PCR) isperformed using reagents supplied in a RNAimage” kit (GeneHunter Corp,Nashville, Tenn.). RNA duplicates from two tissue Samples arereverse-transcribed using oligo (dT) primers and HMV reversetranscriptase (RTase). For a 20 ul reaction, adding (in order) 9.4 ul ofH₂O, 4 ul of 5× reaction buffer, 1.6 ul of 250 uM each dNTPs, 2 ul of 2uM one oligo (dT) primer (H-T11M: M is G, C, or A), and 2 ul of 0.1ug/ul DNA-free total RNA. The RT reactions are run on a thermocycler(e.g., MJ Research, 65° C. for 5 min, 37° C. for 60 min, and 75° C. for5 min). MMLV RTase (1 ul, 100 units) are added in the reaction afterincubation for 10 min at 37° C.: A control is included without addingRTase.

Amplification and labeling. For a 20 ul PCR reaction, add 9.2 ul of H₂O,4 ul of 10× PCR buffer, 1.6 ul of 25 uM each dNTPs, 2 ul of 2 uM H-T11Moligo (dT) primer, 2 ul of 2 uM of arbitrary primer (AP), 2 ul ofcorresponding H-T11M reverse transcribed cDNA, and 0.2 ul (1 unit) ofAmpliTaq DNA polymerase (Perkin Elmer, Norwalk, Conn.). Each of thethree different H-T11M PCR primers are labeled with its own spectrallydistinct fluorescent dye, such as FAM, HEX, and NED. The cDNAs arerandomly amplified by low-stringency PCR (40 cycles with temperature at94° C. for 15 sec, 40° C. for 2 min, and 72° C. for 2 min) in an MJRPCT/100 thermocycler. A final extension is performed at 72° C. for 10min. Samples (without added RTase or cDNA) are simultaneously tested ascontrols. Multiple primer sets can be used. For example, 24 sets ofprimers (8 AP×3 H-T11M) are used in each kit; using 10 kits forscreening differentially expressed cDNA tags produces 240 reactions pertissue.Size separation. Size standards in another dye are then added to theamplified labeled products, and then size separated on a manual orautomated sequencing gel (or capillary) instrument. Differential displaydata generation protocols have been well described (N J Dracopoli, J LHaines, B R Korf, C C Morton, C E Seidman, J G Seidman, D T Moir, and DSmith, ed., Current Protocols in Human Genetics. New York: John Wileyand Sons, 1995), incorporated by reference.

There are other alternative preferred embodiments for generating DNAfragment data whose assay includes a size separation, such asAmplification Refractory Mutation System (ARMS), Single-StrandConformation Polymorphism (SSCP), Restriction Fragment LengthPolymorphism (RFLP), and Amplified Fragment Length Polymorphism (AFLP).These have been enumerated, with associated protocols (N J Dracopoli, JL Haines, B R Korf, C C Morton, C E Seidman, J G Seidman, D T Moir, andD Smith, ed., Current Protocols in Human Genetics. New York: John Wileyand Sons, 1995; ABI/377 and ABI/310 GeneScan Software and OperationManuals, PE Biosystems, Foster City, Calif.), incorporated by reference.

Extracting Profiles

Once DNA fragment sizing data have been generated, the data are thenanalyzed to characterize the and quantity of the component fragments.

Referring to, FIG. 1, Step 1 is for acquiring the data.

The process begins by reading in the generated data from their nativefile formats, as defined by the DNA sequencer manufacturer. Let n be themuter of lanes or capillaries, and m be the number of frequencyacquisition channels. Capillary machines typically produce files whereeach file represents either all m channels of one capillary, or, onechannel of one capillary. Gel-based instruments typically produce fileswhere one file represents either all m channels of the entire image, orone channel of one image. Intermediate cases, with the number ofchannels per file between 1 and m, can occur as well.

Once the signals have been read in, capillary input data signals arepreferably stored in an n×m structure of one dimensional arrays in thememory of a computing device. This structure contains the signalprofiles, with each array element corresponding to one channel of onecapillary. Gel data are preferably stored as m two dimensional dataarrays, one for each acquisition frequency.

The computer software preferably integrates with current sequencer andCE technology. It preferably has two manufacturer-independent inputmodules: one for sequencer gel data (e.g., PE Biosystems ABI/377,Molecular Dynamics Fluorimager), and one for CE data (e.g., PEBiosystems ABI/310, SpectraMedix SCE/9600). These modules are extensibleand flexible, and preferably handle any known sequencer or CE data incurrent (or future) file formats.

Referring to FIG. 1, Step 2 is for processing the signal.

In this step, basic signal processing is done, such as baseline removalor filtering (e.g., smoothing) the data.

In the preferred embodiment for one dimensional signals, baselineremoval is done using a sliding window technique. Within each window(of, say, 10 to 250 pixels, depending on the average number of pixelsamples per base pair units), a minimum value is identified. Usingoverlapping windows, a cubic spline is fit through the minimum points,creating a new baseline function that describes the local minima in eachwindow neighborhood. To remove the baseline, this new baseline functionis subtracted away from the original function.

In the preferred embodiment for two dimensional images, the baseline isremoved as follows. A local neighborhood overlapping tiling is imposedon the image, and minimum values identified. Create a baseline surfacefrom these local minima, and subtract this baseline surface from theimage to remove the baseline from the image.

In the preferred embodiment, filtering and other smoothing is done usingconvolution. A convolution kernel (such as a gaussian or a binomialfunction) is applied across the one dimensional capillary signal, or thetwo dimensional scanned gel image. The radius of smoothing depends onnumber of pixels per base pair unit—denser sampling requires lesssmoothing. However, with overly dense sampling, the data size can bereduced by filtering out redundant points.

Referring to FIG. 1, Step 3 is for separating the colors.

A key element of fluorescent genetic analysis is separating thefluorescent dye signals. In the current art, this is done by;

-   (1) Performing a dye standard calibration experiment using known    dyes, often in separate lanes. A (dye color vs. frequency detection)    classification matrix C is known directly from which dye is used.    Each column of C contains a “1” in the row corresponding to the    known color, with all other entries set to “0”.-   (2) Measuring the signals at separate fluorescent detection    frequencies. These frequencies correspond to the “filter set” of the    sequencer. For each pure dye peak, the signals that are measured    across all the detection frequencies reveal the “bleedthrough”    pattern of one dye color into its neighboring frequencies. Each    pattern is normalized, and stored as a column in a data matrix D.    The j^(th) column of D is the “spectral bleedthrough” system    response to the impulse input function represented in the j^(th)    column of C.-   (3) The relationship between C and D is described by the linear    response dye calibration matrix M as:

D=M×C

To determine the unknown dye calibration matrix M (or its inverse matrixM⁻¹) from the data, apply matrix division, e.g., using singular valuedecomposition (SVD), to the matrices C and D. For example, this SVDoperation is built into the MATLAB programming language (The MathWorks,Inc., Natick, Mass.), and has been described (W H Press, S A Teukolsky,W T Vetterling, and B P Flannery, Numerical Recipes in C: The Art ofScientific Computing, Second Edition. Cambridge: Cambridge UniversityPress, 1992), incorporated by reference.

-   (4) Thereafter, the spectral overlap is deconvolved on new unknown    data D′ to recover the original dye colors C′. This is done by    computing:

C′=M ⁻¹ ×D′

While the current art enables the color separation step, it is notideal. Slight variations in the gel (e.g., thickness, composition,temperature, chemistry) and the detection unit (e.g., laser, CCD,optics) can contribute to larger variations in fluorescent response. Anoperator may encounter the following problems:

-   -   Calibrating the correction matrix M⁻¹ on one gel run does not        necessarily model the “spectral bleedthrough” pattern accurately        on future runs.    -   With 96 (or other multi-) capillary electrophoresis, each of the        capillaries forms its own gel system, whose different properties        may necessitate a separate calibration matrix.    -   Accurate dye calibration is technically demanding, labor        intensive, time consuming, and expensive. Moreover, it can        introduce considerable error into the system, particularly when        the manual procedure is not carried out correctly. In such        cases, the correction is imperfect and artifacts can enter the        system.        Such color “bleedthrough” (also termed “crosstalk” or “pullup”)        artifacts can severely compromise the utility of the acquired        data. In some cases, the gels must be rerun. Often scientific        personnel waste considerable time examining highly uninformative        data.

In one preferred embodiment, the color matrix is calibrated directlyfrom the data, without recourse to separate calibration runs. Thisbleedthrough artifact removal is done using computer algorithms for thecalibration, rather than manually conducting additional calibrationexperiments. This can be done using methods developed for DNA sequenceanalysis based on general data clustering. While such clustering methodsrequire relatively large amounts of data, they can be effective (WHuang, Z Yin, D Fuhrmann, D States, and L Thomas Jr, “A method todetermine the filter matrix in four-dye fluorescence-based DNAsequencing,” Electrophoresis, vol. 18, no. 1, pp. 23-5, 1997; Z Yin, JSeverin, M Giddings, W Huang, M Westphall, and L Smith, “Automaticmatrix determination in four aye fluorescence-based DNA sequencing,”Electrophoresis, vol. 17, no. 6, pp. 1143-50, 1996), incorporated byreference.

It would be desirable to use the least amount of most certain data whendetermining a color matrix for separating dye colors. Note that in thematrix relation “D=M×C”, the data columns D are known from experiment.When a calibration run is used, the 0/1 classification columns C areknown; from these known values, the unknown M can be computed. However,when there is no calibration run, the classification matrix C must bedynamically determined from the data in order to compute M. This can bedone manually by a user identifying peaks, or automatically by thegeneral clustering embodiment. However, there is a more refined andnovel approach to automatically classifying certain data peaks to theircorrect color.

The most preferred embodiment finds a (C, D) matrix pair using minimaldata. Thus, M (and M⁻¹) can be determined, and the rest of the new datacolor separated. This embodiment exploits a key physical fact aboutspectral emission curves: they are unimodal Referring to FIG. 2, suchcurves monotonically rise up to a maximum value, and then monotonicallydecrease from that maximum. Therefore, when a peak it generated from asingle dye color, spectrally sampled across a range of frequencies, andthe intensities plotted as a function of frequency, the plotted curvedemonstrates unimodal behavior—the curve rises up to a maximumintensity, and then decreases again. This physical “unimodality”constraint on single Color peaks can serve as a useful choice functionfor automatically (and intelligently) choosing peak data for theclassification matrix C.

The function space equivalent of a unimodal function has a very usefulproperty. Suppose that m different frequency channels are sampled. Thatis, there are m frequencies x=(x₁, x₂, . . . x_(m)) sampled in thespectral domain. An equivalent representation of the m-point functionf:m→

is as an m-vector v=<x₁, x₂, x_(m)> in the vector space

. Now, select an appropriate norm on this vector-space (say L₁ or L₂),and normalize the vector v relative to its length, forming

$w = {\frac{v}{v}.}$

In an L₁ normalization, w lives on a flat simplex (in the all-positivesimplex facet). In L₂, w lives on a corresponding curved surface.

The unimodality constraint on x imposes additional geometricalconstraints-on w. Because x is unimodal, note that

x₁≤x₂≤ . . . ≤x_(k),

and that

x_(k)≥x_(k+1)≥ . . . ≥x_(m),

where k is the index of the maximum value of x.

These inequality constraints determine the exact subfacet in which wmust reside on the simplex facet. Consider the case of three spectralsampling points, where the three-dimensional vector <x, y, z>∈

, and suppose that the unimodal constraint is x≥y≥z, corresponding tothe first dye color. Referring to FIG. 3, the locations x, y, and zdesignate the unit locations on the axes in

(i.e., at <1,0,0>, <0,1,0>, and <0,0,1>, respectively). Here, the x≥yconstraint produces one region (horizontal shading), the y≥z constraintanother (vertical shading), and their intersection corresponds to thefull constraint (crosshatch shading). Good calibration points for use ascolumns in a {C, D} matrix pair will cluster (white circle) around thedye's actual sampling point ratio vector (black cross). Conversely, poorcandidate calibration points will tend to lie outside the cluster, andcan be rejected. This geometry in the function space permits selectionof only the most consistent calibration data.

The procedure starts by gathering likely unimodal data (e.g., those withhighest intensities) for a given observed color k. After normalization,these candidate calibration data {w_(i)} will cluster within the samesubfacet of a flat (m−1)-dimensional facet; this facet is theall-positive face of the m-dimensional simplex. This geometricconstraint follows directly from the physical unimodality constraint ofpure spectral curves. An entire class of simple and affective clutteringalgorithms are based around exploiting this geometric constraint.

In one preferred embodiment, choose as cluster center point wo, the meanvector location of {w_(i)}; vector wo also lies on the simplex subfacet(FIG. 3). Then, a small inner product <w₁, w₀> value tends to indicateclose proximity of w_(i) and w₀. Taking a small set (e.g.,

) of the closest vectors w_(i) near w₀ selects good calibration datapoints, all pre-classified to color k. To determine M⁻¹, for each colork, take at most s such clustering points (w_(i)), and use thecorresponding (v_(i)) as columns in D, where v_(i) is normalized withrespect to the maximum element in w_(i). Form a vector u that is allzeros, except for a 1 in the k^(th) entry; place s copies of u ascolumns in matrix C. This produces the required calibration matrices Cand D, from which M and its inverse are immediately computed using SVDor another matrix inversion algorithm.

In the most preferred color separation algorithm, the criterion for peakselection is based on minimizing the spectral width of the peaks. A goodmeasure of peak width is the variance across the spectral samplingfrequency points. The variance calculation can take into account thesampling frequencies actually used, if desired. In this procedure, thevariances of candidate peaks for a given color frequency are computed.Those peaks having the smallest spectral variance indicate the bestcalibration points. In an alternative implementation, a best fit of theobserved frequency curve with a known frequency curve (e.g., via acorrelation or inner product maximization) can indicate the best pointsto use.

The method applies not only to raw data, but also to data that has beenpreviously color separated by other (possibly inaccurate) colorcorrection matrices. This is because the unimodality constraintgenerally applies to such data.

A useful feature of the unimodality approach is that the model canautomatically select good calibration peaks, even with very sparse data.This is because the function space geometry effectively constrains theclustering geometry. Such sparse-data clustering algorithms areparticularly useful with Capillary data, where one capillary may onlyhave 1 or 2 useful calibration peaks corresponding to a given color.Despite this data limitation, the method easily finds these peaks, andeffectively separates the colors. Such automation enables customizationof the dye separation matrix to each capillary on each run, whichvirtually eliminates the bleedthrough artifact.

It is useful to have a quality score that measures how well the computedmatrix correction actually corrects the data. This can be done bycomparing the expected vs. observed results; this comparison can becomputed either the separated or unseparated domain. Using theComputer-selected calibration data vectors D, if the computed correctionmatrix M⁻¹ is correct, then the 0/1 calibration matrix of column vectorsC will be recomputed exactly from the data matrix D:

C′=M ⁻¹ ×D

where, theoretically, C′=C. Measuring the deviation between C and C′measures how well the correction worked. One straightforward deviationmeasure sums the normed L₂ deviations between each column of C′ and itscorresponding column in C. When the deviation is too great (e.g., due tofailed PCR amplification), a matrix based on a more confident (manual orautomatic) calibration can be used.Referring to FIG. 1, Step 4 is for removing the primers.

When the separated DNA fragments are formed from labeled PCR primers, ituseful to remove the intense primer signal prior to initiatingquantitative analyses. There are many standard signal processing methodsfor removing a singularity from a signal, or a row of such large peaksfrom an image.

For one dimensional data, first detect the primer signal. This can bedone, in one embodiment, by smoothing (i.e., low pass filtering) thesignal to focus on broad variations, and then fitting the largest peak(i.e., the primer peak) to an appropriate function, such as a Gaussian.Determining the variance of the peak from the function fit provides thedomain interval on which the primer peak should be removed. On thisdomain, it is most preferable to set the values on this interval to anappropriate background value (e.g., zero, if background subtracted, oran average of neighboring values outside the interval). Alternatively,one can crop the interval from the signal.

For two dimensional data, a projection of the pixel data onto thevertical axis of DNA separation finds the row of peaks. Determine thespread of this peak signal by curve fitting (e.g., with a Gaussian).Remove the primer peaks by either cropping the signal from the image, orsetting the values in that domain to zero (or some other appropriatebackground value).

Referring to FIG. 1, Step 5 is for tracking-the sizes.

In the preferred embodiment, size standards are run in the same lane asthe sample data, and are labeled with a label different from the labelused for the sample data. The task is to find these size standardspeaks, confirm which peaks represent good size standard data, and thenalign the observed size standards peaks with the expected sizes of thesize standards. This process creates a mapping between the size standardpeak data. sampled in the pixel domain, and the known sizes (say, inbase pair or molecular weight units) of each peak.

For one dimensional data, there are no lateral lanes to help determinewhich of the teaks observed in the size standard signal represent gooddata. Therefore, a preferred procedure uses prior information about thesize standards (e.g., the size) to ensure a proper matching of datapeaks to known sizes. In the most preferred embodiment, use thefollowing steps:

-   0. Find some good candidate peaks to get started.-   1. Identify the best peaks to use by filtering poor candidates. This    can be done by performing quality checks (e.g., for the height,    width, or peak fit) on the candidate peaks.-   2. Match the expected peak locations to the observed data peaks.    This is done by applying a “zipper match” algorithm to the best    candidate data peaks and the expected sizes. This matching uses    local extension to align the peaks with sizes, and includes the    following steps:    -   2a. Match the boundary, that is, a subset of smallest size data        peaks or largest size data peaks. The algorithm can be rerun        several times, shifting the boundary data peaks or the expected        sizes. The best boundary shift can be found by heuristic        minimization. One Good heuristic assesses the uniformity (e.g.,        by minimizing the variance) of the ratios across matching local        intervals of the expected size standard difference to the        observed data peak difference.    -   2b. Fix a tolerance interval that includes unity (different        tolerance intervals can be tried).    -   2c. Starting with the (possibly truncated) boundaries of the        expected sizes and observed peaks aligned, extend this boundary.    -   2d. Compute the ratio p of the difference between the next        expected peak size and the current expected peak size, to the        difference between the next observed peak size and the current        observed peak size.    -   2e. Compute the ratio q of the difference between the current        expected peak size and the previous expected peak size, to the        difference between the currant observed peak size and the        previous observed peak size.    -   2f. If the ratio (p/q) is greater than the tolerance interval's        greatest value, then the observed data peak falls short. Reject        the observed data peak, advance to the next observed data peak,        and continue.    -   2g. If the ratio (p/q) is less than the tolerance interval's        least value, then the expected data size falls short. Reject the        expected data size, advance to the next expected data size, and        continue.    -   2h. If the ratio (p/q) lies within the tolerance interval, then        the expected data size is well matched to the observed data        peak. Accept the size and the peak, record their match, and        advance to the next size and peak.-   3. If desired, fill in any missing data peaks by interpolation with    expected-sizes from the zipper matching result.-   4. Compute a quality score for the matching result. Useful scores    include:    -   4a. Fitting the matching of the expected sizes (in base pair, or        other expected unit) with the peak sizes (in pixels, or other        observed data unit). The relationship is monotonic and typically        slowly varying, so a deviation from a fitted function (e.g.,        linear, cubic, or Southern mobility relationship) works well.    -   4b. Another quality score is the number of size mismatches,        adjusted by the boundary shift.-   5. Report the peak positions, matched with size standards. It is    also useful to include the quality score of the match.

For two dimensional gel data, proceed by tracking on the color-separatedsize standard image. To track simultaneously both the sizes and lanes,first focus on a boundary row (e.g., the top row) of size standards.Model the row geometry (e.g., curvature, size, location) and each of thepeaks (e.g., height, shape, angle). Then use this pattern as a set ofgeometrical constraints for analyzing the next row. Use global search tofind the next row, and more localized search to identify the peaks;refine the local peak locations using center-of-mass calculations.Continue this process iteratively until the size standard data arecompletely analyzed, and a final grid is produced. It is possible to usea prior analysis of the loading/run pattern as calibration data in orderto speed up subsequent tracking. The output of this lane/size trackingis a mapping between the expected (lane, base pair size) coordinategrid, and the observed (x pixel, y pixel) gel data image coordinategrid.

The quality of the tracking result can be scored by comparing theexpected grid locations with the observed data peak locations—straightdata grids indicate a high quality result, whereas large distortionsindicate a lower quality result. One useful quality score measures thecurvature of the observed peaks by forming the sum of local neighbordistances, and normalizing relative to the neighbor distances in astraight idealized grid. This grid is preferably formed from the knownlane loading pattern, the known size standard positions, and observedgrid data (e.g., mean boundary positions).

With two dimensional gel image data f(x,y,z), z denoting the colorplane, there also is the step of extracting lane profiles from the gelimage to form a set of one dimensional profiles. The lane trackingresult is used in this step. For each lane, form the mapping fromy-pixel values to the (x,y) pixel locations indicated by the sizestandard location found for the lane. Create a spline function (e.g.,cubic) that interpolates through the mapped v:y→x pixel pairs to allother (x,y) pixel values. Then, extract a one dimensional profile (foreach color k), where the domain values down the lane are the desiredsequential pixels y, and the range values are computed from the functionf(v(x),y,z). This procedure is done for each data image plane z,extracting either color separated data (by applying the correctionmatrix) or unseparated data.

Referring to FIG. 1, Step 6 is for extracting the profiles.

The signal profile of a capillary or lane is f(x_(pixel),z), where zdenotes the label color (or channel). This function is available forcapillary data available after processing Step 2, and for gel data afterextracting Profiles in Step 5.

The pixel sampling of the electrophoresis distorts the sizing of the rawdata. The size tracking result from Step 5 provides as set of(x_(pixel); x_(size)) pairs. Use these matched pairs to form thecoordinate transformation μ:x_(pixel)→x_(size). Combining the functionsf and μ via a double interpolation to form:

f0μ⁻¹(x_(size),z),

a new function that describes the signal as a function of size standardunits (instead of pixel units), preferably in base pair size units. Ifcolor separation has not yet been done, the correction matrix is appliedat this time. These transformed (capillary or gel data) profiles arethen preferably stored in an n×k data structure of sized profiles, wheren is the number of pixel samples, and k is the number. of dye colors.

Note that the f_(pixel)0μ⁻¹ (x_(size), z) transformation which maps

x_(size)→f_(size)(x_(size))

can be usefully understood as the commutative diagram:

The transformation is preferably implemented by performing a doubleinterpolation. In MATLAB, this operation can be readily computed usingthe ‘interp1’ function where:

YI=interp1(X,Y,XI, method)

interpolates to find YI, the interpolated values of the-underlyingfunction Y at the points in the vector XI; the vector X specifies thepoints at which the data Y is given.

With ‘interp1’, interpolate the desired size domain points into computedpixels using the monotonic bijection between expected sizes and observedpeak pixels:

pixels=interp1(sizes, peaks, domain, ‘spline’)

Cubic spline interpolation is done by setting the method to ‘spline’.Then, interpolate the function f on the desired size domain points(i.e., the computed pixels) using the monotonic bijection f(i) betweenindices {i} and data values f{i}:

indices=[1:length(data)];

profile=interp1(indices, data, pixels, ‘spline’);

Computing Profile completes the commutative diagram, and produces thenew function f_(size)(x_(size)).

Referring to FIG. 4, visualizations are shown for the results ofprocessing Steps 1-6.

The computer software preferably extracts and analyzes dataautomatically, without human intervention (if so desired). The softwareseparates colors using a matrix, which is either precalibrated orcreated from the data, depending on which module is used. The softwaretracks lanes and size standards on two-dimensional gel dataautomatically by mapping the expected two dimensional lane/size grid tothe observed size standard data; on one-dimensional CE data, sizetracking is done separately for each capillary. The user interfacesoftware makes manual retracking, zooming, and single-click access tothe chemistry panel (or sample data used on a run) available to the userthroughout.

A chain of custody is maintained in that the user preferably cannot moveto the next step without saving results (manually or automatically) fromthe current step. Changes are saved, not discarded; moreover, thesoftware records these changes incrementally, so that the audit trailcannot be lost by early program termination.

Backtracking capability and flexibility are preferably included in thesoftware. For example, should fully automated lane tracking fail due tolow-quality data, the user can choose to: (a) edit the results, and havethe program re-track, (b) edit the results without re-tracking, (c)manually track the lanes and sizes, or (d) reject the low=quality gel. A“revert” operation provides a universal Undo operation for automaticallyrolling back major Processing steps.

Data Scoring

Referring to FIG. 5, Step 7 is for deriving an allelic ladder.

It is useful to have a set of reference peaks that (a) correspond to theactual locations of DNA molecules on the gel, (b) have known lengths (inbase pair units), and (c) cover a large part of the sizing window. Thisreference set can be developed for any fragment sizing genetic assay;without loss of generality, the preferred embodiment is described forSTR genotyping.

Construct a partial allelic ladder by PCR amplifying a pool of DNAsamples. This allelic ladder, and optionally a known sample, arepreferably loaded into the electrophoretic system as separate signals(i.e., in different lanes or colors). After the gel is run, referring toFIG. 1, Steps 1-6 are performed to initially process the signals. Then,preferably in size coordinates (rather than in pixel coordinates), apeak finding algorithm locates and sizes the peaks in these two laneswith respect their in-lane size standards. The known sample's sized.peaks are then compared with the similarly sized peaks found in theladder lane. Following this comparison, the known DNA lengths are thenassigned to the allelic ladder. The DNA length labels of these DNAlengths are then propagated to the unlabeled ladder peaks.

A preferred peak labeling procedure is:

1. In the ladder signal, find the domain positions (i.e., the sizesrelative to the size standards) of the allelic peaks.

2. Perform a relational labeling from the known signal to the allelicladder signal, as follows. Find the peaks of the known signal, andassign to them their known sizes (in allelic sizes units, such as basepairs). Then, match (in size standard units) these known allele peaks tothe peaks in the ladder signal having corresponding size. Designatethese ladder peaks with the allelic size labels of the known peaks.

3. Extend these allelic label assignments from the allelic ladder peaksdesignated with known labels to the rest of the ladder peaks.

3a. When the expected size pattern of the alleles on the allelic ladderis known (as with previously characterized forensic ladders), a robustmethod for assigning size labels to data peaks uses a standardopen/closed search set artificial intelligence (AI) algorithm. Startfrom the most confident data (i.e., the knowns) as the closed set, withall remaining peaks in the open set. Each cycle, (i) select an openladder allele that is nearest in allele size to a closed ladder allele,(ii) predict the calibrated size (from the size standards) of thisallele using interpolation relative to the closed peaks, (iii) find theallelic ladder peak whose calibrated size is closest to the predictedsize in (ii), and (iv) if the quality score (based on size deviation) isgood enough, move this candidate open peak to the closed set. Ontermination, only the most confident ladder peaks in the observed datahave been matched to expected ladder alleles. See (N J Nilsson,Principles of Artificial Intelligence. Palo Alto, Calif.: TiogaPublishing Co 1980), incorporated by reference.

3b. When expected ladder alleles are not available (e.g., withuncharacterized pooled DNA samples), the ladder pattern alone (peakspacing or peak heights) usually contains sufficient information fordesignating the labeling. Start with the most confident data—knownallele peaks, or tallest ladder peaks. Then, locally extend the allelelabels to the size peaks. Since there is more uncertainty without aknown ladder pattern, more search is useful. One preferred method is to(i) find the tallest (most confident) peak, (ii) match size with nearest(bp size) allele, (iii) locally align iteratively for smaller and largerpeaks, (iv) assign a quality score to the alignment, (v) repeat thePreceding three steps, but shifted one or two sizes up or down, and (vi)select the alignment with the highest quality score.

4. Optionally, fill in any missing ladder peaks with interpolation byreference to past ladder data.

5. Fill in the virtual points on the ladder. These are expected ladderalleles that are believed to exist in the population, but that are notrepresented as peaks in the allelic ladder. This is done byinterpolation.

6. Return the results assigning allele labels (e.g., in base pair orother integer units) to data ladder peaks (e.g., in size standardunits). It is preferable to report the Quality score of the assignment.

In nonforensic STR applications, where previously characterized laddersare generally unavailable, pooled alleles of a given marker can be usedas a reference ladder. This novel approach can help eliminate the sizebinning problem that plagues microsatellite and other STR geneticmethods. It is preferable to use the same allelic ladder across multipleruns. The ladder can be comprised of either pooled DNAs that are PCRamplified together, or post-PCR amplified products that are then pooledtogether. It may be desirable to visually inspect the ladders. In onepreferred embodiment, previously uncharacterized ladders are checked thefirst time that they are encountered, with a human editor identifyingthe best peaks to use and matching them against their expected sizes.Recording Quantitative peak data for a ladder can enable the use ofquantitative computer-based matching of reference ladders and newunknown ladders (e.g., for peak and size alignment) by correlation orother inner product methods.

When the interpolation and extrapolation have finished, the allele sizesthat were actually present as peaks in the ladder data, as well asdesired allele-sizes that were not present as peaks, have all beenallocated to size positions.

SNP ladders can be developed in a similar way to STR ladders. Withmultiplexed SNPs, it is useful to run allelic sizing ladders comprisedof actual SNP data for the markers of interest in a signal. Then,comparison of the unknown SNP sample data with the SNP allelic laddercan remove uncertainty regarding the correct size (hence, allele)assignment. The same reasnoning applies to any size-based assay forwhich a (partial or complete) ladder of candidate solutions can bedeveloped.

With gel electrophoresis, it is preferable to include on the gel atleast one allelic ladder for every marker used. For example, one lanecan be dedicated to ladders for the marker panel used in the otherlanes; this lane can be loaded in duplicate on the gel. It is alsodesirable to include at least one known reference allele lane on everygel; this signal can be one or mora positive PCR controls. The advantageof duplicate control lanes (for ladders and reference controls) is thatwhen there is a PCR, loading, detection, or other failure in one lane,the signal in the other lane can be used. Moreover, a comparison of thetwo (or more) signals can suggest when such a failure may have occurred.

With capillary electrophoresis, it is similarly preferable to use ladderand reference controls.

-   -   With single capillary systems (such as the ABI/310), these        controls should be run at some point during the lifetime of the        capillary. Preferably, the controls should be run as often as        the temporal variation in the sizing system (i.e., differential        sample and size standard migration) warrant.    -   With a multiple capillary instrument (e.g., ABI/3700, MegaBACE,        etc.) each capillary can form its own electrophoretic        .separation system. in the most preferred embodiment, the        allelic ladder, known reference samples, and any other capillary        controls (for fragment sizing, color separation, etc.) are run        at least once for every capillary, with the calibration results        then applied specifically to that capillary. For example,        consider the case of using one panel comprised of a set of        markers, with this panel applied to a set of s samples, run out        on an n-capillary instrument that achieves r sequential runs per        capillary with acceptable sizing fidedility. Suppose (for        high-throughput studies) that, approximately, n×r≤s. Then, one        of the runs (e.g., run number 1 or r/2) should have allelic        ladders in all n capillaries, and another run (e.g., run number        2 or 1+r/2) should have allele references in all n capillaries.

Generating data containing these allelic ladder sizing controls, andanalyzing the data as described in this step, reduce run-to-run sizingvariation. Such reduction in gels or capillaries is crucial fatachieving reproducible sizing assays, Variable run Conditions (e.g.,temperature, gel consistency, formamide concentration, sample purity,concentration, capillary length, gel thickness, etc.) can inducedifferential sizing between the PCR products in the sample and theinternal size standards. These allelic ladder comparison methods helpcorrect for such variation.

Referring to FIG. 5, Step 8 is for transforming coordinates.

In the preferred embodiment, size comparisons used in the analysis areperformed in the allelic ladder size coordinate system, rather than inthe size standard coordinate system. While the latter approach is alsoworkable, the former method has the advantages that the reference systemis comprised of DNA bands size calibrated to actual integer DNA moleculelengths, rather than to artibrary fractional molecular weight sizingunits. Using integer DNA lengths (e.g., true base pairs) is closer tothe physical reality, and can simplify the data analysis, logicalinference, communication with the user, quality assurance, and errorchecking.

In one preferred embodiment, the signals are kept in size standardunits, but comparisons are made in the allelic ladder frame ofreference. For example, in comparing sample peaks with allelic ladderreference peaks, the sample peaks can first be interpolated into adomain based on the allelic ladder peak sizes, prior to comparing thesample peaks with any other reference sizes.

A ladder-based peak sizing method can establish a direct connectionbetween observed peak sizes and actual DNA fragment lengths (possibly upto a constant shift). Transforming data sizes into DNA lengths overcomessize binning problems. For example, rounding number to the nearestinteger in DNA length (i.e., ladder) coordinates permits the assigningof consistent labels to each peak; this consistency is not achieved whenround fractional peak size estimates. Moreover, the peak's deviationfrom the integer ladder provides a quality measure for how consistentlythe peaks are sizing on a particular size separation run.

In the most preferred embodiment, the sample signals are transformedfrom size standard units into allelic ladder DNA length units. Step 6provides the sized signal profile of a lane (or capillary):

f0μ⁻¹(x_(size)).

Step 7 provides a characterized allelic ladder that matches the allelesof the ladder in size standard units with corresponding DNA moleculelengths in base pairs (or other suitable integer-spaced units). Thispairing defines a coordinate transformation for that ladder:

λ:x_(size)→x_(length).

Combining the function f0μ⁻¹ together with the function λ (this can bedone using a double interpolation) forms:

f0μ⁻¹0λ⁻¹(x_(length)),

which represents the signal intensity f in terms of DNA lengthx_(length). The relevant mathematics and computer implementations aredetailed above in Step 6, extracting the profiles. For each marker, withn lanes or capillaries, and k colors, these transformed profiles can bepreferably stored in an n×k data structure (e.g., in memory, or in afile).

This procedure provides a method for analyzing a nucleic acid sample.The steps included: (a) forming labeled DNA sample fragments from anucleic acid sample; (b) size separating and detecting said samplefragments to form a sample signal; (c) forming labeled DNA ladderfragments corresponding to molecular lengths; (d) size separating anddetecting said ladder fragments to form a ladder signal; (e)transforming the sample signal into length coordinates using the laddersignal; which in turn permit (f) analyzing the nucleic acid samplesignal in length coordinates, as follows.

Referring to FIG. 5, Step 9 is for quantitating signal peaks.

In addition to sizing DNA peaks, it is also useful to quantitate therelative amount of DNA present. To do this accurately requires takingaccount of band overlain. Few systems currently perform this bandoverlap analysis; those that do (e.g., Cybergenetics' TrueAllelesoftware), use a combinatorial approach that increases analysis timegreatly as the number of adjacent peaks increases. This combinatorialcost cat impede analyses of large allelic ladders or of differentialdisplay data. In the preferred embodiment, as described herein, the DNAquantitation step for resolving band overlap should computationallyscale (e.g., at linear or small polynomial cost) with the number ofbands analyzed.

In DNA length coordinates, as developed in Step 8, a peak has a naturalshape that stems from band broadening as it migrates down the gel orcapillary. (For accurate peak quantitation, DNA length coordinates arepreferable to size standard coordinates, and size standard-coordinatesare preferable to pixel scan coordinates.) Centered at location x_(k),this peak shape can be described as a Normal function

${\varphi_{k}^{L}(x)} \propto e^{{- \frac{1}{2}}{(\frac{x - x_{k}}{\sigma})}^{2}}$

on the leading edge, and a Cauchy function

${\varphi_{k}^{R}(x)} \propto \frac{1}{1 + \left( \frac{x - x_{k}}{\alpha} \right)^{2}}$

on the receding edge; the functions shame the same height h at x_(k).Band broadening implies that as x_(k) increases, the width parameters σand α will increase, the height h will decrease, while the area remainsconstant. This is confirmed by fitting peek spread data as a function ofDNA length with a third order exponential function.

Using the changing band shapes as a set of basis functions, write thetrace f (in DNA length coordinates) as:

${f(x)} = {\sum\limits_{k = 1}^{n}{c_{k} \cdot {\varphi_{k}(x)}}}$

where basis function ϕ_(k) depends on <x_(k), σ_(k), α_(k), h_(k)>,i.e., the center position x_(k), Normal spread σ_(k), Cauchy spreadα_(k), and center height h_(k).

From the data f(x), solving for the coefficients {c_(k)} provides an,estimate of the DNA concentration at each peak. Less efficientapproaches (e.g., TrueAllele) currently solve this equation by aleast-squares search of the 4n coefficients to the data f, withcombinatorial computing time proportional to exp(4n). However, byexploiting the model functions ϕ_(k), together with function spacemathematics, this computing time can be reduced to roughly linear cost,proportional to n. See (F Riesz and B Sz.-Nagy, Functional Analysis. NewYork: Frederick Ungar Publishing Co., 1952), incorporated by reference.

Normalize the basis functions ϕ_(k) so that

<ϕ_(k)(x), ϕ_(k)(x)>≡1

and note that the band overlap coefficients b_(k,j) can be numericallyestimated from the model functions as

<ϕ_(k)(x), ϕ_(j)(x)>≡b_(k,j)

Then observe that with initial estimates of x_(k) (assuming one peak perbp size k), rewrite the inner product <f, ϕ_(j)> as:

${\langle{f,\varphi_{j}}\rangle} = {{\left( {\sum\limits_{k = 1}^{n}{c_{k} \cdot \varphi_{k}}} \right) \cdot \varphi_{j}} = {{\sum\limits_{k = 1}^{\kappa}{c_{k} \cdot \left( {\varphi_{k} \cdot \varphi_{j}} \right)}} = {\sum\limits_{k = 1}^{n}{c_{k} \cdot b_{k,j}}}}}$

Setting the data derived values {d_(j)=<f, ϕ_(j)>}, and usingappropriate vector-matrix notation, yields the relation:

d=c×B

With sparse peak data that has little band overlap (e.g.,tetranucleotide repeat data) B=I_(n) (the identity matrix), and c≈dimmediately yields the DNA concentrations at every peak k. Moregenerally, B is largely an identity matrix (i.e., primarily zerosoff-diagonal), but has a few email near-diagonal elements (the bandoverlaps) added in. Solve for c from the observed data vector d and theband overlap matrix B using a very fast matrix inversion algorithm(e.g., SVD) that exploits the sparse nature of the local overlapcoefficients.

In the most preferred embodiment, the overlap matrix B is used torapidly estimate the DNA concentrations (c_(k)) at the peaks. This isdone by computationally exploiting the functional analysis. Since

d=c×B,

rewriting yields:

d×B ⁻¹ =c×B(×B ⁻¹)=c

or

c=d×B ⁻¹;

and so the DNA concentrations c can be estimated immediately from thevalues {d_(j)=<f, ϕ_(j)>} derived from data signal f once the ϕ_(j) andtheir overlap integrals B=<ϕ_(j), ϕ_(k)> are known.

In the preferred embodiment, estimate the ϕ_(j) basis functions andtheir overlaps. Each basis function ϕ_(k) depends on <x_(k), σ_(k),α_(k), h_(k)>. To reduce this four dimensional search to one dimension,proceed as follows.

-   -   x_(k) can be estimated from the peak center.    -   h_(k) is not a factor, since each ϕ_(k) is normalized to        integrate to 1.    -   σ_(k) and α_(k), are empirically observed to be in a fixed ratio        to one another (at least in local neighborhoods), due to the        relatively constant peak shapes.        Therefore, a computationally efficient approach is to perform a        one dimensional search for σ_(k) and α_(k), keeping their ratio        fixed. Then, if so desired, perform a quick local two        dimensional search in the solution neighborhood to refine the        values of σ_(k) and α_(k).

To find the peak width parameters σ_(k) and α_(k), first fix a value ofσ in a local neighborhood; in 1D search, set α to a fixed proportion ofσ; in 2D search, set the value of a as well. For each observed peak, setthe center x_(k) of ϕ_(k) to the peak's center location. Followingnormalization, the basis functions (ϕ_(k)) are determined. Then, for j'sin the neighborhood, compute d_(j)=<f, ϕ_(j)>, or:

d=<f, ϕ>

Determine B by numerical (or closed form) integration of the basisfunctions ϕ_(j) and ϕ_(k):

B=[

_(k), ϕ_(j)

]

Compute B⁻¹ by inverting B, or:

B ⁻¹=[

ϕ_(k), ϕ_(j)

]⁻¹

Note that terms far from the diagonal are essentially zero. These localneighborhood calculations of d and B⁻¹ are rapid, and are used in theinner loop of the following minimization procedure.

Minimize the difference between the observed data signal f and theexpected model function:

${f(x)} = {\sum\limits_{k = 1}^{n}{c_{k} \cdot {\varphi_{k}\left( {x - x_{k}} \right)}}}$

by substituting the c_(k) computed from:

c=d×B ⁻¹

Minimization (e.g., using an L₂ norm) of the expression:

${{{f(x)} - {\sum\limits_{k = 1}^{n}{\left( {d \times B^{- 1}} \right)_{k} \cdot {\varphi_{k}\left( {x - x_{k}} \right)}}}}}_{1}$

in the neighborhood as σ (and possibly α) vary, finds the best estimateof the widths of the basis function. These widths can be modeledlocally, interpolated by fitting with a cubic exponential function, andthen used to generate appropriate basis functions across the full rangeof sizes.

Application of these modeled basis functions to the data function fproduces robust estimates of the DNA concentration c_(k) at each DNApeak k, since, after convergence,

c=d×B ⁻¹.

In the current art, workers use the heights or areas of the data peaks.Because of the extensive peak shape modeling done here, it is preferableto use the heights (i.e., c_(k)) or areas (determined from c_(k) and theφ_(k) peak shape, e.g., by closed form evaluation) of thecomputationally modeled peaks.

The match-based quantitation described herein is very well suited tohighly multiplexed data, such as SNPs, differential display, DNA arrays,and pooled samples. This is because the inner product operations (andlocal band overlap corrections) can be applied accurately and inconstant time, regardless of the number of data peaks in the signal.

It is useful to assess the quality of the individual peak quantitationresults. This assessment can be done by comparing the modeled datafunction

${\hat{f}(x)} = {\sum\limits_{k = 1}^{n}{c_{k} \cdot {\varphi_{k}(x)}}}$

with the observed data signal f. A normalized L₂ deviation (computed,via a correlation) between expected and observed based on theminimization search function above can be used as the comparisonmeasure.Referring to FIG. 5, Step 10 is for analyzing the data, preferably bycalling the alleles.

In the preferred embodiment,, allelic (or other) DNA ladder data isavailable, and the alleles can be called by matching sample peaksrelative to the ladder peaks. This match operation is fast, reliable,very accurate, and accounts for inter-gel or inter-capillary variations.Depending on the application, a window (typically ±0.5 bp) is set arounda ladder peak of calibrated DNA length. When a sample peak (preferablyin ladder coordinates) lies within this bp size window, it can bereliably designated as having the length of that ladder peak. Zero sizedeviation between the centers of sample and ladder peaks indicates aperfect match; greater deviations indicate a less reliable match.

In an alternative preferred embodiment, the data profiles can be storedin size coordinates, and brought into length coordinates Only whenneeded. This, is done by retrieving a sample's size coordinate profile,as well as and the ladder peaks. The sample's peaks are found in sizecoordinates, and then interpolated in into length coordinates using theladder size to peak mapping, as described previously. The sample's peaksare then in length coordinates, and can be rounded to the nearestinteger, or matched against integers representing valid alleledesignations. The deviation from such integers (e.g., the fractionalcomponent) can be used a measure of quality, accuracy, or concern.

In other embodiments, found in the current art's fragment sizingsoftware, the peak sizes (e.g., the alleles in genotyping applications)are analyzed in a size standard coordinate system, and-laddercalibration data is not Used by the computer. This analysis entails abottom-up collection and, comparison of data from many samples to formdata-directed bins. These bin distributions are then used to designatesample Peak size. However, the distribution variance across multipleelectrophoresis runs on different samples can be quite high. When thesebins have overlapping sizes, allele size designation becomes suiteuncertain.

In the preferred embodiment, both DNA length and amount are, used inscoring the data. With STR genotyping, for example, Where there can bemultiple peeks, the DNA concentration modeled peak height or area) isused as measure of confidence in the observed peak with a heterozygotegenotype, two peaks are expected, and with a homozygote, only one. Sinceother peaks may be present from mixtures, PCR artifacts, or other DNAsources, the analysis will focus on the higher concentration peaks,particularly those peaks residing in allelic ladder windows. Most of thepeak signal mass should be concentrated in the most confident peaks:high DNA amount, and in a ladder window. When this is not the case,confidence in the data is lower.

Once the peaks have been quantitated at known DNA lengths, the data canbe further analyzed. Such analyses include stutter deconvolution,relative amplification correction/allele calling, allele frequencydetermination (from pooled samples), differential display comparisons,and mutation detection. In genotyping applications, allele callingshould be done on the signals only after corrections (e.g., for stutteror relative amplification) have been made. See (M W Perlin, “Method andsystem for genotyping,” U.S. Pat. No. 5,541,067, Jul. 30, 1996; M WPerlin, “Method and system for genotyping,” U.S. Pat. No. 5,580,728,Dec. 3, 1996; S-K Ng, “Automating computational molecular genetics:solving the microsatellite genotyping problem,” Carnegie MellonUniversity, Doctoral dissertation CMU-CS-98-105, Jan. 23, 1998),incorporated by reference.

In one preferred embodiment, allele calling on quantitated correcteddata is done by:

-   1. Finding the largest peak (area or height), and ensuring that is    within a window on the allelic ladder.-   2. Removing all peaks from the signal that either (a) have a DNA    length that is not in a window of the allelic ladder, or (b) have a    DNA amount that is not within some minimum percentage of the largest    peak.-   3. Calling the alleles by matching the DNA lengths of each sample    peak to the DNA sizing windows on the allelic ladder.-   4. Applying rules to check for possible data artifacts. Typical    rules are described below.-   5. Computing a quality score, particularly for those data apparently    free of data artifacts. Various quality score components are    discussed below.-   6. Recording the designated alleles, and the quality of the result.    Referring to FIG. 6, the results of ladder processing, peak    quantitation and allele calling can be visualized.

There are many “junk-detecting” rules that can be designed and appliedto data. Critical to all such rules is the ability to compare observedmeasures against expected behaviors. By modeling the steps of theprocessing, computing appropriate quality scores at each step, andcomparing these observed data features with normative results, theinvention enables a precise computer diagnosis of problems with datasignals and their quality. Some example rules include:

-   -   Noise only. Using measures (such as Wilcoxon's signed-rank        statistic) to test for randomness, a computer program can decide        that an experiment produced primarily noise (B W Lindgren,        Statistical Theory, Fourth Edition. New York, N.Y.: Chapman &        Hall, 1993), incorporated by reference.    -   Low signal. The Peaks should have heights over a certain        (user-defined) minimum threshold. When a profile's highest peak        does not reach that threshold, flag the problem.    -   High signal. The peaks should not be over a certain        (user-defined) maximum threshold. When a profile's highest peak        does exceeds that threshold, flag the problem.    -   Peak dispersion. The designated peaks should comprise a certain        percentage of the total signal. If a profile's designated peaks        contain less than that percentage, fire the rule.    -   Algorithm conflict. When multiple scoring algorithms are used,        they should agree on the scoring results. Report any conflicts.    -   Relative-height. For some applications (e.g., forensic STR        analysis) the relative peak heights should be within a        predefined ratio of each other. Indicate when a genotype has a        second peak with a relative height that is too low.    -   Third peak. One (homozygote) or two peaks (heterozygote) should        contain most of the DNA signal. Indicate when a genotype has a        third peak that contains too much DNA signal.    -   Off ladder. All the allele peaks should be close to their ladder        peaks. When one (or more) of the alleles are too far away from        their ladder peak, inform the user.    -   Uncorrelated. When there are two allele peaks, their centers        should deviate similarly from their respective ladder peaks.        When a genotype has deviations that do not correlate (i.e.,        their difference exceeds some threshold), flag the problem.    -   Control check. The calls computed for a control experiment        should be consistent with the known results. When they are not,        flag the result.        Note that some of these rules (e.g., “off-ladder”) make use of        allelic ladders, when they are present.

To better understand the decision support role of each numerical qualityscore (such as those used in rules), and their decision thresholds, itis useful to collect the scores for a large set of data and analyzethem. Collection proceeds by recording the set of scores for eachapplicable genotype result, and indexing the genotypes (say, by sample,gel, and locus). Consider each such numerical score as a mapping fromthe genotypes to

. Classify the genotypes by how they were scored; one preferredclassification includes unscorable (e.g., noise), low quality (e.g.,rule firings), correctly called good data (e.g., human agreement), andincorrectly called good data (e.g., human agreement). The resultclassification can be obtained by comparing the computer calls (and rulefirings) with human edited (or otherwise independently scored) data. Onsome useful subset of genotypes (e.g., each locus examined separately),the numerical quality scores can be collated and histogrammed for eachresult classification; this produces a set of distributions, one foreach classification. Comparing (numerically, visually, statistically,etc.) the different distribution profiles for the differentclassifications provides insight into the utility and scaling of anumerical quality score, permitting the derivation of decisionthresholds or probability models. In a preferred embodiment, a decisionthreshold is statistically determined by distinguishing two scoredistributions (e.g., correct and incorrect results according to adetermined sensitivity or specificity. In another preferred embodiment,linear or logistic regression is used to model the probability of anaccurate allele call. These thresholds or probabilities can be displayedto a user for enhanced confidence or decision support.

It is useful to compute a quality score on the good data; one criterionfor “good data” is that the experiment does not trigger any “junk”detecting rules. Quality scores enable the ranking of experiment resultsfor selective review, as well as the computation of accuracyprobabilities. Many possible quality scores can be developed, dependingon the application and the available data. In all cases, there is somekind of comparison between expected behavior and an observedresult—small deviations indicate high quality, whereas large deviationssuggest lower quality. Example quality adores include:

-   -   Domain measures. Whet ladder data is present, it is possible to        compute the deviation between a candidate allele peak center and        its associated ladder peak center. When ladder data is not        present, a similar comparison can be made relative to a        precalibrated bin center; rather then a ladder peak center. One        useful score is the maximum (over the scored alleles) of the        absolute value of the deviations. When this number is close to        zero, the confidence is high. As it increases to 0.5 bp, the        result becomes less confident.    -   Range measures. A sizing data result pertains to an particular        number of peaks; any additional peaks represent a dispersion of        the signal mass away from the hypothesized score. Ideally, all        the signal mass should be present in the called peaks, which can        be measured by the peak centering ratio=(called peaks)/(all        peaks). When this ratio is near unity, the confidence is high.        As it decreases to zero, the result becomes less confident.    -   Product measures. A product of a domain measure with a range        measure can be a sensitive indicator of quality. In one        preferred embodiment, let a result scale with 1 as the highest        quality, and 0 as the lowest quality. Rescale the domain measure        above to map the score interval [0,0.5] to the quality interval        [1,0], with all scores above 0.5 set to 0. Rescale the range        measure above to map the score interval [0,1] to the quality        interval [1,0]. Then form the product of these two rescaled        scores, so that the result lies in the interval [0,1].    -   Function measures. Once all data corrections have been made        (e.g., for PCR stutter, relative amplification, peak shapes,        peak sizes, peak center deviations, band overlap, etc.) on the        fully quantitated and modeled signal, the inverse of these        corrections can be applied to the size result to resynthesize        the signal. Comparison of the resynthesized signal with the data        signal provides a measure of how completely the analysis modeled        the data—the residual deviation can measure lack of confidence        in the result. One such comparison is a correlation between the        synthesized and data signals. This correlation can be computed        so that small values indicate confidence (e.g., using a        normalized L₂ deviation), or so that larger values indicate        confidence (e.g., using a normalized inner product or        statistical correlation measure).        The development of some quality scores has been described (M W        Perlin, “Method and, system for genotyping,” U.S. Pat. No.        5,876,933, Mar. 2, 1999; S-K Ng, “Automating computational        molecular genetics: solving the microsatellite genotyping        problem,” Carnegie Mellon University, Doctoral dissertation        CMU-CS-98-105, Jan. 23, 1998; B Pálsson, F Pálsson, M Perlin, H        Gubjartsson, K Stefánsson, and J Gulcher, “Using quality        measures to facilitate allele calling in high-throughput        genotyping,” Genome Research, vol. 9, no. 10, pp. 1002-1012,        1999), incorporated by reference.

Probabilities can be computed from quality scores. The individualcomponents of the quality scores generally lie on a numerical scale.Histogramming and multivariate regression analysis can provide insightinto the distribution of the correctly scored “success” data populationrelative to the distribution of the incorrectly scored “failure” datapopulation along each of these measures. A logit transformation of thisdichotomous outcome variable is useful, and provides [0,1] bounds forprobability estimates; and the use of a binomial (rather than normal)distribution for the error analysis. By applying standard logisticregression, the key underlying independent variables can be elucidated.This logistic regression analysis can help determine the thresholds usedin the “junk-detection” rules, and, for each experiment, can compute theprobability of an accurate score from the observed variables. Forexample, the domain and range measures used above can be used as twoindependent variables, with the outcome being success or failure in thecomputer correctly calling the allele results. Logistic regression onthese variables with a preanalyzed data set can be used to construct acorrectness probability far allele calls on further data sets. See (AAgresti, Categorical Data Analysis. New York, N.Y.: John Wiley & Sons,1990; D W Hosmer and S Lemeshow, Applied Logistic Regression. New York:John Wiley & Sons, 1989; Statistical Analysis Software, SAS Institute,Cary, N.C.; Statistical Package for the Social Sciences, SPSS Inc.,Chicago, Ill.), incorporated by reference.

Data Review

Once the automated, computer scoring has completed, it is often usefulto have a person assess the results. Since reviewing perfectly, scoreddata is an inessential step, the data review should optimally focus onreviewing the least confident (but scorable) data. The outcome of thisfocused review is typically a decision for each experiment's result:accept the result, modify the result, reject the result, plan to redothe experiment in the laboratory, and so on.

In order to focus the data review, it is helpful to have aprioritization. In the preferred embodiment, the quality score oraccuracy probability is used to rank the experiments. The review mayarrange the suspect experiments in different subsets, e.g., grouped bymarker, sample, equipment, personnel, time, location, etc. By reviewingthe worst data first (i.e., rule firings, least quality scores), theuser is enabled to focus on evaluation and repair of only the suspectdata. Not reviewing highly confident data frees up human operator timefor other process tasks. Moreover, not reviewing unscorable data issimilarly useful.

Using, these methods, low quality data can be identified and classified.At the single datum level, individual results can be examined and betterunderstood. At the data set level, problematic loci, samples or runs canhe identified by examining percentages of, outcome indicators (e.g.,,rule firings, low quality scores, calling errors) relative to datastratifications. For example, the number of miscalls of known samples(arranged by locus and gel run) can identify Problematic data trendsearly on. This information can provide useful quality control feedbackto the laboratory, which can improve the overall quality of future datageneration.

Referring to FIG. 7, it is useful to present the results graphically.Visualizations of the experiments (e.g., gel lane tracking or colorseparation) and the signal traces help the user understand potentialproblems with the data, and possibilities for their correction (both forthe particular experiment, as well as for the overall data generationprocess). In the preferred embodiment, multiple graphical visualizationsfor inspecting and reviewing genotype data include interfaces for:

-   -   inspecting and annotating the raw gel data;    -   inspecting and editing the automated lane tracking and sizing;    -   assessing data quality and marker size windows;    -   reviewing and editing automatically called alleles (by quality        scare priority, or other user-selected data rankings),        preferably with all data and inferences visually presented in        the context of the marker's allelic ladder;    -   visually examining the data signal, preferably overlaid upon an        allelic ladder;    -   A flexible examination of bleedthrough (or “pull-up”) artifact;    -   reviewing computed DNA quantitations and sizes, preferably in        the context of the allelic ladder. A more detailed window        explores the quantitation results.    -   showing the allele calls;    -   providing access to a mote detailed textual window presenting a        summery of useful allele calling information in tabular form;    -   focusing on the allelic ladder data signals, when available; and    -   displaying selected multiple lane signals graphically in one        view.        Such multiple representations visually show the quality of the        data, and provide diverse, focused insights into the data and        their processing. Such graphical interfaces are generally used        on only a small fraction (e.g., less than 58-10%) of the data,        since most high-throughput users do not care to revisit        high-quality allele calls on high-quality data.

Referring to FIG. 8, it is also useful to present the results textually.Textual information can facilitate a more detailed understanding of aparticular experiment. It is helpful for the user to have rapid accessto both graphical and text presentation formats. The preferredembodiment provides information on allele designations, laddercomparisons, molecular weight sizes, and genotype quality. This displayalso gives explanations of rules that may have fired. The table includespeak size, peak height, peak area, and peak fit quality. The display isextensible, permitting modifications to the display that can draw fromthe expected and observed values that are computed for each scoredgenotype.

The preferred embodiment provides very flexible navigation between thegraphical and textual view. A gel display permits viewing Of anycombination of gel image, size standard peak centers, or lane/sizetracking grid, as well as editing capability. An allele call displayprovides navigation facilities (e.g., buttons and menus) for rapidlyselecting samples and loci, including zoom, slider, overlay, andrelational options. With a single click, the user can examine the peakand ladder. sizes of any designated allele. The interface also automatesmany of the mundane display aspects (e.g., signal resizing, qualityprioritization), thereby enabling rapid user navigation.

For maximum flexibility, the preferred embodiment reports results tocomputers and people in multiple ways. Preferred modalities include:

-   -   Providing a flexible format (e.g., tabbed text files, or SQL        queries) for seamless interaction with database, spreadsheet,        laboratory information management system (LIMS), text editing,        and other computer software.    -   Providing such a flexible format for input information.    -   Providing such a flexible format for output results.    -   Providing such a flexible format for logging, audit, and error        messages.    -   Recording rule fittings and quality scores (along with allele        calls) in result files (e.g., tabbed text) in such a flexible        format. The rule firing representation is extensible and        backwardly compatible, so that it is easy to add more rules over        time as more cases are observed.    -   When a window focused on a genotype, displaying the fired rules        (if any), thereby setting the context for why particular low        quality data were rejected.    -   Listing or explaining the fired rules in plain English in a        textual window for the human operator.    -   Listing the designations, allelic ladder information, molecular        weight deviations (between the designation and its allelic        ladder), quality sabre, or a table of peak sizes, areas,        heights, and qualities.

Different data artifacts are typically associated With their ownspecific visualizations. For example, a signal containingsize-designated standard peaks do not overlay properly on a differentsignals size-designated standard peaks; referring to FIG. 9, thisimproper overlay can be visualized by superimposing the signals indifferent colors. For most data artifacts, human data reviewerspainstakingly (a) contract an appropriate visual representation (e.g.,an overlay of specific signals) to (b) confirm or disconfirm thepresence of the artifact. The more efficient approach of the preferredembodiment is to reverse this order. First (b), have the computerautomatically determine (from rules or other quality scores) what dataartifacts are present, and then (a) automatically Constructvisualizations that are customized to each specific artifact. Thesedata-directed visualizations can be opened up automatically, ordisplayed Upon request.

Referring to FIGS. 4, 6, and 9, to efficiently develop a library of suchdata artifact customized displays, the invention includes a graphicallanguage interpreter. An element of the display library is a messagetemplate that operationally characterizes how to display the artifact.This template is filled in by applying it to specific data. A messagefor the interpreter includes a set of n-tuples that describe the type ofdisplay, the source of data, a size range, fluorescent dye, or otheruseful display information. The data source preferably refers to the n×krepresentation of data signals. This interpreter then dispatches an thedisplay type (e.g., data signal, vertical line, fitted curve,annotation, etc.) and possible subtype (e.g., main data, size ladder,allelic ladder, known control, negative control, etc.) of each tuple tosupply additional display information (e.g., drawing color, line style,line thickness, etc.) that is specific to that display type. Executionof the set of messages constructs and presents the customized display tothe user for its corresponding data artifact.

Analysis System

Referring to FIG. 10, the present invention pertains to a system foranalyzing a nucleic acid sample 102, as specified above. The systemcomprises a means 104 for farming labeled DNA sample fragments 106 froma nucleic acid sample. The system further comprises means 108for sizeseparating and detecting said sample fragments to forma sample signal110, said separating and detecting Means in communication with thesample fragments. The system further comprises means 112 for forminglabeled DNA ladder fragments 114 corresponding to molecular lengths. Thesystem further comprises means 116 for size separating and detectingsaid ladder fragments to form a ladder signal 118, said separating anddetecting means in communication with the ladder fragments. The systemfurther comprises means 120 for transforming the sample signal intolength coordinates 122 using the ladder signal, said transforming meansin communication with the signals. The system further comprises means124 for analyzing the nucleic acid sample signal in length coordinates,said analyzing means in communication With the transforming means.

Special Applications

For many applications, it is useful to generate sizing results that arecomparable across different DNA sequencer instruments. This platforminteroperability is essential, for example, when creating a referenceDNA. database (e.g., for human identification in forensics, ormulti-laboratory genetic analyses) with data comparisons from multiplelaboratories. Should different instruments at different laboratoriesyield different sizing results for PCR products relative to the size,standards, then such DNA reference resources become almost useless. Thepresent invention uses sizing ladders based on such sample fragments(e.g., allelic ladders, with or without known reference samples) inorder to assure that sizing results are based on sample fragment sizes.Specifically, said sizing results are not based solely onelectrophoretic migration of DNA fragments relative to size standardmolecules; such relative migration can be highly variable acrossdifferent instruments, and even on the same instrument when differentrun conditions are employed. The present invention overcomes thislimitation in the current art, uses novel computer-based scoring ofproperly calibrated data to provide automated sizing of DNA fragments,and enables true interoperability between different sizing instruments.

Via this interoperability, the invention provides a method for producinga nucleic acid analysis. Steps include: (a) analyzing a first nucleicacid sample on a first size separation instrument to form a firstsignal; (b) analyzing a second nucleic acid sample on a second sizeseparation instrument to form a second signal; (c) comparing the firstsignal with the second signal in a computing device with memory to forma comparison; and (d) producing a nucleic acid analysis of the twosamples from the comparison that is independent of the size separationinstruments used.

In forensic applications, it useful to match a sized-based genotype(e.g., STR or SNP) of a sample against a reference genotype. In makingsuch forensic match comparisons, it is preferable to have a computerscoring program designate alleles relative to an allelic ladder, ratherthan using an “exact” size relative to sizing standards. This is becauseof the highly variable differential migration of the sizing standardsrelative to the PCR products. An allelic ladder (whetherprecharacterized or dynamically characterized) provides standardreference DNA molecule lengths. By comparing and reporting sample DNAfragment lengths relative to these constant reference DNA lengths (andnot to variable size standard comigration units), it is possible toreliably match genotypes of a given sample with those of a referencesample. This comparison and reporting is enabled by the presentinvention. Moreover, this reliability is essential for humanidentification applications where near zero error is required. Incriminal comparisons, the sample DNA Profile may be from a stain at acrime scene, whereas the reference DNA profile is from a suspect or apreexisting DNA database, with the goal of establishing a connectionbetween a suspect and a crime scene. In civil applications, the sampleand reference DNA profiles may be used to determine degree ofrelatedness, as in a paternity case. (I W Evett and B S Weir,Interpreting DNA Evidence: Statistical Genetics for Forensic Scientists.Sunderland, Mass.: Sinner Assoc., 1998), incorporated by reference.

PCR artifacts can complicate DNA fragment sizing. With STRs, PCPstutter, relative amplification (also termed “preferentialamplification”), and constant bands are common artifacts. Usingcomputer-based scoring methods, these artifacts can be resolved bystutter deconvolution, adjusting relative peak heights based on fragmentsize, detecting and suppressing artifactual bands, and otherquantitative methods. (M W Perlin, “Method and system for genotyping,”U.S. Pat. No. 5,541,067; Jul. 30, 1996; M W Perlin, “Method and systemfor genotyping,” U.S. Pat. No. 5,580,728, Dec. 3, 1996; S-K Ng,“Automating computational molecular genetics: solving the microsatellitegenotyping problem,” Carnegie Mellon University, Doctoral dissertationCMU-CS-98-105, Jan. 23, 1998), incorporated by reference.

Differential display is a sensitive assay for relative gene expression.In automated computer data analysis of differential display geneexpression profiles, the objective is to identify size bins at whichthere is a demonstrable difference between the DNA amounts present inthe two profiles. In a preferred embodiment, to compare:

-   -   Transform the expression profiles so that each resides in a        uniform DNA siting coordinate system, referring to FIG. 1,        Step 6. When the pixel representation is highly reproducible        (e.g., when running serially on a single capillary), this        transforming step can be omitted.    -   Identify the peaks in each profile, and record their x domain        (DNA size) and y height, (estimated DNA concentration) values.    -   Compare the domain values of the peaks in each profile, and form        a set of matching paired peaks (one from each profile). Retain        those paired peaks with close x values. In the preferred        embodiment, the difference between the x values is less than or        equal to ½ base pair.    -   Retain the cross-profile peak pairs having relatively close y        value ratios. Compute the standard deviation of the ratio of the        y values for peak pairs, and remove those peak pairs whose        y-value. ratios lie outside a certain standard deviation range        (e.g., one standard deviation).    -   Rescale the profile ranges so that they approximately        superimpose. Normalize the first profile by dividing by its        maximum y value. Model the y-value ratios polynomially (e.g.,        linearly) as a function of x. Normalize the second profile by        dividing by the modeled y-value ratio function.    -   Refine the peak matching by a two dimensional comparison that        requires the matching peak pairs to satisfy both a certain        x-tolerance (e.g., 0.5 bp) and certain y tolerance. (e.g., 5%        relative height). When the 1D peak matching does not produce        spurious results, this refining step may be omitted.    -   Use the matched peaks as a peak ladder, and transform the        coordinates of one profile into the coordinates of the second        profile, referring to FIG. 1, Step 8. When the peaks are very        close (preferably less than 0.25 bp deviation), this        transforming step may be omitted.    -   Compare the superimposed profiles, end identify peak pairs whose        calculated y-value differences or ratios are outliers, relative        to the paired y value standard deviations. These outlying peak        pairs represent possibly significant up- or down-regulation of        gene expression.    -   Report the identified outlying peak pairs. This can be done as        two lists (one for up-regulation, and one for down-regulation).        Within each list, rank the results by, the y value deviation.        The results of such an Analysis are shown in FIG. 11, which        demonstrates no evident variation in the repeated running of one        differential display sample.

Software Description

Referring to FIG. 12, the data scoring software is preferably maintainedin aversion control system. After testing has completed, program changesare committed. For each supported platform (Macintosh, windows, Unix,etc.), an automated assembly computer retrieves the updated software andsupporting data from the version control server, preferably over acomputer network. Then, this computer compiles, the software and datafor run time operation on the specific target platform, and followsautomation scripts to assemble the materials into an installer process,preferably for CD-ROMs or Internet installation (e.g., MindVision's VISEfor the Macintosh, or InstallShield for Windows). Hypertextdocumentation for the software is maintained, updated, and compiled in across-platform format, such as bookmarked PDF files, preferably usingautomated document authoring software (e.g., Adobe's FrameMakerprogram). For each supported platform, the application program,associated data, and documentation are included on the web or diskinstallers. Users preferably download or update their software using theweb internet installer; alternatively, a disk installer can be usedlocally on a computer or local area network.

The automated scoring software maintains an audit trail of its actions,operation, and decisions as it processes the data according the steps ofFIGS. 1 and 5. The data formats are kept operational for specificplatforms by automatically checking and transforming certain files(e.g., text representations) prior to the program's using these files onthat Platform. The data format permits multiple instrument dataacquisition runs (e.g., gel or capillary loadings) to be processedtogether in a single computer analysis run. These software featuresreduce human intervention and error.

User feedback on the program's operation is preferably entered at adesignated web site onto a reporting HTML form that includes expectedand observed program behavior: Via CGI scripts, these reports areautomatically logged onto a database (e.g., FileMaker Pro on a Macintoshover a local area network), and appropriate personnel may beautomatically notified via email of potential problems.

Business Considerations

The automated quality maintenance system described herein for generatingand analyzing DNA fragment data has a nonobvious business model. It isdesirable for groups to generate and analyze their own data, since theusers of the data often have the greatest incentive to maintain dataquality. Moreover, people involved in the data generation task requirecontinual feedback from the data analysis results. By generating datathat includes proper calibration reference standards (e.g., internalsize standards, allelic ladders, reference traces, etc.), high qualitydata can be automatically analyzed in ways that lower quality datacannot.

The cost of manual scoring of data is quite high. The Preferred businessmodel includes a spreadsheet that permits an end-user to calculate theirlabor costs. Referring to FIG. 13, a prospective or current customer canenter parameters related to the cost of labor, the data generationthroughput, and how effectively the labor force analyzes the generateddata. From these factors, the spreadsheet can calculate the total laborcost, as well as the Libor cost per experiment, for the specificcustomer requirements. This calculating tool is preferably madeavailable to customers in spreadsheet form (e.g., as aplatform-independent Microsoft Excel 98 spreadsheet) as a commuter filethat can be distributed on disk, via email attachment, or downloadablefrom the internet. In a complementary embodiment, the spreadsheetfunctionality can be provided as an interative form on a web page. Bybetter understanding the labor costs of data scoring, a customer candevelop insight into the role of quality data and automatedcomputer-based data scoring as a direct replacement for labor.

Error is another significant cost in the human scoring of genetic data.In genetics, error severely compromises the power of linkage and otherassociation measures, so that despite considerable research time, cost,and effort, genes are less likely to be discovered. In forensics, errorcan lead to incorrect identification of suspects, failed convictions ofcriminals, or failure in exonerating the innocent. Thus, methods thatreduce error or improve data quality confer significant advantages tothe user.

Since automated computer-based scoring of DNA sizing data is equivalentto human labor that performs the same task, the pricing model of suchautomated scoring is preferably bayed on usage. A fee is charged forevery genotype scored. This fee preferably includes components forintellectual property, software support, and user customizations. Withvery high levels of usage, some components such as user customizationcan be reduced in line with the associated business expense reduction.

For better market penetration, pricing levels should be set near orbelow the equivalent human labor cost. The result is an automatedcomputer-based scoring process that is faster, better and cheaper thanthe equivalent human review of data. specifically, the computer-basedprocess can produce more consistent results with lower error, leading tomore productive use of the scored data.

Additional services (including user training, system setup, softwaremodifications, quality audits) are best charged separately. Preferably,there is a mandatory initial interaction with the customer group (forwhich the user pays) to train new users in the quality data generationand computer-based analysis process.

Quality assurance is an integral part of the process and the businessmodel. A quality entity (e.g., a corporation) can help support thisquality maintenance system (QMS). This entity can provide qualityassurance by spot checking the user's data generation or scoring of theDNA fragment size data. This quality assurance can be done by (a) theentity providing the user with samples (characterized by the entity) fordata generation or analysis, and then comparing the user's results withthe entity's results, (b) the user providing the entity with samples(characterized by the user) for data generation or analysis; and thenhaving the entity comparing the entity's results with the user'sresults, (c) a comparison of data results involving a third party, or(d) a double-blind comparison study.

The quality entity can use these quality assurance methods to conductquality audits for different sites, and disseminate the results (e.g.,by internet publication). Then, different data generation sites cancompete with one another for business based on their quality andcost-effectiveness. Beyond a certain critical mass, it would be highlydesirable for such DNA analysis sites to have certification provided bythe quality entity on the basis of such audits. Those sites with thehighest quality data and using the best automation software should bethe most competitive. The quality entity can also provide a service(e.g., at regular time intervals, say annually) for quality checks onsites that desire to achieve and maintain the best possible dataresults.

This procedure describes a method for generating revenue from computerscoring of genetic data. The steps include: (e) supplying a softwareprogram that automatically scores genetic data; (b) forming genetic datathat can be scored by the software program; (c) scoring the genetic datausing the software program to form a quantity of genetic data; and (d)generating a revenue from computer scoring of genetic data that isrelated to the quantity. Moreover, additional process steps include: (e)defining a labor cost of scoring the quantity of genetic data when notusing the software program; (f) providing a calculating mechanism forestimating the labor cost from the quantity; (g), determining the laborcost based on the quantity; and (h) establishing a price for using thesoftware program that is related to the labor cost.

Mixture Analysis

In forensic science, DNA samples are often derived from more than oneindividual. With the advent of quantitative analysis of STR data, thereis the possibility of computer-based analysis that can resolve thesedata. Specifically, there is a need to find or confirm the identity ofcomponent DNA profiles, as well as determine mixture ratios. In thepreferred embodiment, the quantitation of the DNA samples isaccomplished by performing Steps 1-6 of FIG. 1, and Steps 7-9 of FIG. 5.The accurate quantitation conducted in Step 9 enables accurate analysisof quantitative mixture data, and improves on the prior art that usesmodeled peak area or peak height (GeneScan Software, PE Biosystems,Foster City, Calif.) which provide potentially inaccurate estimates ofDNA concentration. The invention's quantitative mixture analysis is partof Step 10 of FIG. 5 in the case of DNA mixtures.

The present invention is distinguished over the prior art in that ituses a linear mathematical problem solving formulation that combinesinformation across loci, and can completely integrate this informationautomatically on a computing device. By inherently using all theinformation from all the loci in the formulation, robust solutions canbe achieved. The prior art uses manual or Bayesian methods on a perlocus basis that entail human intervention in generating or combiningpartial results. Such prior forensic mixture analysis methods have beendescribed (P Gill, R Sparkes, R Pinchin, T M Clayton, J P Whitaker, andJ Buckleton, “Interpreting simple STR mixtures using allele peak area,”Forensic Sci. Int., vol. 91, pp. 41-53, 1998 ; J W Evett, P Gill, and JA Lambert, “Taking account of peak areas when interpreting mixed DNAprofiles,” J. Forensic Sci., vol. 43, pp. 62-69, 1997; T M Clayton, J PWhitaker, R Sparkes, and R Gill, “Analysis and interpretation of mixedforensic stains using DNA STR profiling,” Forensic Sci. Int., vol. 91,pp. 55-70, 1998), incorporated by reference.

There are different formulations of the mixture problem. With a mixtureprofile derived from two individuals, problems include:

-   -   verifying the mixture and computing the mixture ratio, given the        profiles of both individuals;    -   determining the profile of one individual and the mixture ratio,        given the Profile of another individual; and    -   determining the profiles of both individuals and the mixture        ratio, given no other DNA profiles.        These problems can be similarly extended to a number of        individuals greater than two. The STR mixture method described        herein addresses all of these problem formulations.

In the PCR amplification of a mixture, the amount of each PCR productscales in rough proportion to relative weighting of each component DNAtemplate. This holds true whether the PCPs are done separately, orcombined in a multiplex reaction. Thus, if two DNA samples A and B arein a PCR mixture with relative concentrations weighted as wA and wB(0≤wA, wB≤1, wA+wB=1), their corresponding signal peaks after detectionwill generally have peak quantitations (height or area) showing roughlythe same proportion. Therefore, by observing the relative peakproportions, one can estimate the DNA mixture weighting. Note thatmixture weights and ratios are interchangeable, since the mixture weight

$\frac{\lbrack A\rbrack}{\lbrack A\rbrack + \lbrack B\rbrack}$

is in one-to-one correspondence with the mixture ration

$\frac{\lbrack A\rbrack}{\lbrack B\rbrack}.$

To mathematically represent the linear effect of the DNA sample weights(wA, wB, wC, . . . ), write the linear equation:

p=G×w,

where p is the pooled profile column vector, each column i of genotypematrix G represents the alleles in the genotype of individual i (takingallele values 0, 1, 2, with a total of 2 alleles), and w is the weightcolumn vector that reflects the relative proportions of template DNA orPCR product. To illustrate this coupling of DNA mixture weights withpredicted pool weights, if there are three individuals A, B, Crepresented in a mixture with weighting wA=0.5, wB=0.25, wC=0.25, and atone locus the genotypes are:

A has allele 7 and allele 2,

B has allele 1 and allele 3, and

C has allele 2 end allele 3,

then combining the vectors via the linear equations:

$\begin{bmatrix}{alleles} \\{in} \\{mixture}\end{bmatrix} = {\left\lbrack {{\begin{bmatrix}{alleles} \\{of} \\A\end{bmatrix}\begin{bmatrix}{alleles} \\{of} \\B\end{bmatrix}}\begin{bmatrix}{alleles} \\{of} \\C\end{bmatrix}} \right\rbrack \times \begin{bmatrix}{wA} \\{wB} \\{wC}\end{bmatrix}}$

and representing each allele as a position in a column vector, producesthe linear relationship,

$\begin{bmatrix}0.75 \\0.75 \\0.75\end{bmatrix} = {\left\lbrack {{\begin{bmatrix}1 \\1 \\0\end{bmatrix}\begin{bmatrix}1 \\0 \\1\end{bmatrix}}\begin{bmatrix}0 \\1 \\1\end{bmatrix}} \right\rbrack \times \begin{bmatrix}0.50 \\0.25 \\0.25\end{bmatrix}}$

Note that the sum of alleles in each allele column vector is 20normalized to equal two, the number of alleles present.

With multiple loci, the weight vector w is identical across all theloci, since that is the underlying mixture in the DNA template. Thiscoupling of loci can be represented in the linear equations by extendingthe column vectors p and G with more allele information for additionalloci. To illustrate this coupling of DNA mixture weights across loci,add a second locus to the three individuals above, where at loons twothe genotypes are:

A has allele 1 and allele 2,

B has allele 2 and allele 3, and

C has allele 3 and allele 4,

then combining the vectors via the partitioning:

$\begin{bmatrix}{{locus}\; 1} \\{mixture} \\{alleles} \\{-- -} \\{{locus}\; 2} \\{mixture} \\{alleles}\end{bmatrix} = {\left\lbrack {{\begin{bmatrix}{{locus}\; 1} \\{A^{\prime}s} \\{alleles} \\{-- -} \\{{locus}\; 2} \\{A^{\prime}s} \\{alleles}\end{bmatrix}\begin{bmatrix}{{locus}\; 1} \\{B^{\prime}s} \\{alleles} \\{-- -} \\{{locus}\; 2} \\{B^{\prime}s} \\{alleles}\end{bmatrix}}\begin{bmatrix}{{locus}\; 1} \\{C^{\prime}s} \\{alleles} \\{-- -} \\{{locus}\; 2} \\{C^{\prime}s} \\{alleles}\end{bmatrix}} \right\rbrack \times \begin{bmatrix}{wA} \\{wB} \\{wC}\end{bmatrix}}$

and representing each allele as a position in a column vector, produces:

$\begin{bmatrix}0.75 \\0.75 \\0.50 \\{-- -} \\0.50 \\0.75 \\0.50 \\0.25\end{bmatrix} = {\left\lbrack {{\begin{bmatrix}1 \\1 \\0 \\{-- -} \\1 \\1 \\0 \\0\end{bmatrix}\begin{bmatrix}1 \\0 \\1 \\{-- -} \\0 \\1 \\1 \\0\end{bmatrix}}\begin{bmatrix}0 \\1 \\1 \\{-- -} \\0 \\0 \\1 \\1\end{bmatrix}} \right\rbrack \times {\begin{bmatrix}0.50 \\0.25 \\0.25\end{bmatrix}.}}$

With multiple loci, there is more data and greater confidence inestimates computed from the linear equations.

More precisely, write the vector/matrix equation p=G×w for mixturecoupling (of individual and loci) as coupled linear equations thatinclude the relevant data:

${p_{ik} = {\sum\limits_{j}{g_{ijk}w_{j}}}},$

where for locus i, individual j, and allele k

-   -   p_(ik) is the pooled proportion in the observed mixture of locus        i, allele k;    -   g_(ijk) is the Genotype of individual j at locus i in allele k,        taking values 0 (no contribution), 1 (heterozyote or hemizygote        contribution), or 2 (homozygote contribution), though with        anomalous chromosomes other integer values are possible; and    -   w_(j) is the weighting in the mixture of individual j.

Given partial information about equation p=G×w, other elements can becomputed by solving the equation. Cases include:

-   -   When G and w are both known, then the data profile, p can be        predicted. This is useful in search algorithms.    -   When G and p are both known, then the weights w can be computed.        This is useful in confirming a suspected mixture, and in search        algorithms.    -   When p is known, inferences can be made about G and w, depending        on the prior information available (such as partial knowledge of        G). This is useful in human identification applications.        The DNA mixture is resolved in different ways, demanding on the        case.

Assume throughout that the mixture profile data vector p has beennormalized for each locus. That is, for each locus, let NumAlleles bethe number of alleles found in data for that locus (typicallyNumAlleles=2, one for each chromosome). For each allele element of thequantitation data within the locus, multiply by NumAlleles, and divideby the sum (over the alleles) of all the quantitation value's for thatlocus. Then, the sum of the normalized quantitation data is NumAlleles,as in the illustrative example above.

To resolve DNA mixtures, perform the steps: (a) obtain DNA profile datathat include a mixed sample; (b) represent the data in a linearequation; (c) derive a solution from the linear equation; and (d)resolve the DNA mixture from the solution. This procedure is illustratedin the following cases.

First consider the case where all the genotypes G and the mixture data pare known, and the mixture weights w need to be determined. This problemis resolved by solving the linear equations p=d×w for w using SVD orsome other matrix inversion method. Specifically, w can be estimated as:

w=G\p

using the SVD matrix division Operation in MATLAB.

Next consider the Cage of two individuals A and B where one of the twogenotypes (say, A) is known, the mixture weights w are known, and thequantitative mixture data profile p is available. Expand p=G×w in thiscase as:

p=wA·gA+wB·gB

where gA and gB are the genotype column vectors of individuals A and B,and wA and wB=(1−wA) are their mixture weights. Then, to resolve thegenotype, rewrite this equation as

gB=(p−wA·gA)/wB

and solve for gB by matrix operations. The computed gB is the normalizeddifference of the mixture profile minus A's genotype. The accuracy ofthe solution increases with the number of loci used, and the quality ofthe quantitative data.

Next consider the case of making inferences about the genotype matrix Gstarting from a mixture profile p. This case has utility in forensicscience. In one typical scenario, a stain from a crime scene may a DNAmixture from the victim and an unknown individual, the victim's DNA isavailable, and the investigator would like to connect the unknownindividual's DNA profile with a candidate perpetrator. This scenariotypically occurs in rape cases. The perpetrator may be a specificsuspect, or the investigator may wish to check the, unknown individual'sDNA profile against a DNA database Of possible candidates. If themixture weight wA were known, then the genotype gB could be computedimmediately from the matrix difference operation. of the precedingparagraph.

Since wA is not known, a workable approach is to search for the best win the [0,1] interval that satisfies additional constraints on theproblem, set wA equal to this best w, compute the genotype g(wA) as afunction of this optimized wA value, and Set gB=g(wA). A suitableconstraint is the prior knowledge of the form that possible solutiongenotype vectors g can take. It is known that solutions must have avalid genotype subvector at each locus (e.g., having alleles taking onvalues 0, 1 or 2, and summing to 2). This knowledge can be translatedinto a heuristic function of g(w) which evaluates each candidategenotype solution g against this criterion.

In the preferred embodiment, the heuristic is a function of w, theprofile p, and the known genotype gA. Since p and gA are fixed for anygiven problem, in this case the function depends only on, the variablew. For any given w in (0,1), compute g(w) as (p−w·gA)/(1−w). Then, ateach locus, compute and record the deviation dev_(locus)(g(w)). Thedev_(locus) function at one locus is defined as:

-   -   Assume the genotype comprises one allele. Compute the deviation        by finding, the index of the largest peak, and forming a vector        one allele that has the value 2 at this index and is 0        elsewhere. Let dev1 be the L₂ norm of the difference between        g(w) and one allele.    -   Assume the genotype comprises two alleles. Compute the deviation        by finding the index of the two largest peaks, and forming a        vector two allele that has the value 1 at each of these two        indices and is 0 elsewhere. Let dev2 be the L₂ norm of the        difference between g(w) and two allele.    -   Return the the lesser of the two deviations as minimum (dev1,        dev2).        To compute dev(g(w)), sum the component dev_(locus)(g(w)) at        each locus. That is, the heuristic function is the scalar value

${{dev}\left( {g(w)} \right)} = {\sum\limits_{loci}{{dev}_{locus}\left( {g(w)} \right)}}$

Appropriately optimize (e.g., minimize) this function over w in [0,1] tofind wB, and determine gB as g(wB). In one alternative preferredembodiment, the summation terms can be normalized to reflect alternativepreferred weightings of the loci or alleles. In a different alternativepreferred embodiment, various heuristic functions can be used thatreflect other reasonable constraints on the genotype vectors, as in (PGill, R Sparkes, R Pinchin, T M Clayton, J P Whitaker, and J Buckleton,“Interpreting simple STR mixtures. using allele peek area,” ForensicSci. Int., vol. 91, pp. 41-53, 1998), incorporated by reference.

Referring to Step 10 of FIG. 5, further quality assessment can beperformed on the computed STR profile derived from the mixture analysis.As described, rule checking can identify potentially anomalous allelecalls, particularly when peak quantities or sizes do not conform toexpectations. Quality measures can be computed on the genotypes, whichcan indicate problematic calls even when no rule has fired. Onepreferred quality score in mixture analysis is the deviation dev(gB) ofthe computed genotype. Low deviations indicate a good result, whereashigh scores suggest a poor result. These deviations are best interpretedwhen scaled relative to a set of calibration data which have beenclassified as correct or incorrect. Of particular utility is thepartitioning of the deviations by locus, using the locus deviationfunction dev_(locus) (gB). When a locus has an unusually high deviation,it can be dropped from the profile, and the resulting partial profilecan be used for human identity matching.

The most preferred embodiment is demonstrated here on data generatedusing the 10 STR locus SGMplus panel (PE BioSystems, Foster City,Calif.), and size separated and detected on an ABI/310 genetic analyzersequencing instrument. A mixture proportion of 30% sample wand 70%sample B was used. Referring to FIG. 14, a minimization search forweight w by computing dev(g(w)) gave a weighting of 29.73%, which isvery close to the true 30%. The L₂ deviation dev(g(w)) of the computedgenotype from the closest (and correct) feasible solution was 0.3199.The computed genotype for all the loci is shown in the columns PROFILE(exact) and GENO B (rounded) in the table below.

Data and results are shown in the table below, where MIXTURE is thenormalized peak quantitation data from the mixed sample, GENO A is theknown genotype of individual A, PROFILE is the numerical genotypecomputed for determining B's genotype; GENO B is the resulting integergenotype (and, in this case, the known genotype as well), and DEWS arethe square deviations of the PROFILE from GENO B. Quality assessment ofthe computed PROFILE shows uniform peaks that are consistent with acorrect genotype. Examination of the square deviation components foreach allele reveals no significant problem, and the largest within locussum of squares deviation was the small value 0.16 (for locus D21S11).

LOCUS- ALLELE MIXTURE GENO A PROFILE GENO B DEVS D3S1358-14 1.03651.0000 1.0520 1.0000 0.0027 D3S1358-15 0.9635 1.0000 0.9480 1.00000.0027 vWA-17 1.4755 0 2.0999 2.0000 0.0100 vWA-18 0.5245 2.0000 −0.09990 0.0100 D16S539-11 1.4452 0 2.0567 2.0000 0.0032 D16S539-13 0.28891.0000 −0.0120 0 0.0001 D16S539-14 0.2660 1.0000 −0.0447 0 0.0020D2S1338-16 0.3190 1.0000 0.0308 0 0.0010 D2S1338-18 0.6339 0 0.90211.0000 0.0096 D2S1338-20 0.3713 1.0000 0.1052 0 0.0111 D2S1338-21 0.67580 0.9618 1.0000 0.0015 D8S1179-9 0.7279 0 1.0359 1.0000 0.0013D8S1179-12 0.2749 1.0000 −0.0320 0 0.0010 D8S1179-13 0.6813 0 0.96961.0000 0.0009 D8S1179-14 0.3160 1.0000 0.0265 0 0.0007 D21S11-27 0.27871.0000 −0.0265 0 0.0007 D21S11-29 0.7876 0 1.1208 1.0000 0.0146D21S11-30 0.9337 1.0000 0.9057 1.0000 0.0089 D18S51-12 0.3443 1.00000.0669 0 0.0045 D18S51-13 0.6952 0 0.9894 1.0000 0.0001 D18S51-14 0.67550 0.9613 1.0000 0.0015 D18S51-17 0.2850 1.0000 −0.0176 0 0.0003 D19S433-0.6991 0 0.9949 1.0000 0.0000 12.2 D19S433-14 0.6060 2.0000 0.0161 00.0003 D19S433-15 0.6949 0 0.9890 1.0000 0.0001 THO1-6 0.3178 1.00000.0291 0 0.0008 THO1-7 1.0074 1.0000 1.0105 1.0000 0.0001 THO1-9 0.67490 0.9605 1.0000 0.0016 FGA-19 1.0580 1.0000 1.0826 1.0000 0.0068 FGA-240.2830 1.0000 −0.0203 0 0.0004 FGA-25.2 0.6589 0 0.9378 1.0000 0.0039

In an alternative preferred embodiment, the heuristic function can beextended to account for the curvature of deviations (as a function ofw), so that only local minima are considered and not boundary points.Genotype elimination constraints can be applied when there is extraknowledge, such when the mixture weight (hence proportion of sample B)is low and sample A's genotype can be excluded in certain cases. It isalso possible to provide for interactive human feedback, so that expertassessments can work together with the computing method.

When stutter peaks are a concern, use stutter deconvolution tomathematically remove the stutter artifact from the quantitative signal(M W Perlin, G Lancia, and S-K Ng, “Toward fully automated genotyping:genotyping microsatellite markers by deconvolution,” Am. J. Hum. Genet.,vol. 57, no. 5, pp. 1199-1210, 1995), incorporated by reference. Theprior forensic art uses Bayesian approaches to account for stutter (PGill, R Sparkes, and J S Buckleton, “Interpretation of simple mixtureswhen artefacts such as stutters are present—with special reference tomultiplex STRs using by the Forensic Science Service,” Forensic Sci.Int., vol. 95, pp. 213-224, 1998), incorporated by reference. However,direct stutter removal from the signal can be highly robust, since it isworking directly at the level of the stutter artifact, prior to anymixture computation.

In the case when both genotypes are unknown, use additional search.Start from the linear equations p=G×w. As necessary, form feasiblesubmatrices H(locus) of G, where each H is an K×2 matrix, representing.K alleles (rows) for 2 individuals (columns). Here, H=(g_(ijk)), wherelocus i is fixed, individual j∈{1,2}, and allele k∈{1, 2, . . . , K}.For example, the six feasible four allele (K=4) genotype pairs arerepresented by the six genotype matrices {H_(i)|i=1, 2, . . . , 6}:

$H = {\begin{bmatrix}g_{*11} & g_{*21} \\g_{*12} & g_{*22} \\g_{*13} & g_{*23} \\g_{*14} & g_{*24}\end{bmatrix} \in \left\{ {\begin{bmatrix}1 & 0 \\1 & 0 \\0 & 1 \\0 & 1\end{bmatrix},\begin{bmatrix}1 & 0 \\0 & 1 \\1 & 0 \\0 & 1\end{bmatrix},\begin{bmatrix}1 & 0 \\0 & 1 \\0 & 1 \\1 & 0\end{bmatrix},\begin{bmatrix}0 & 1 \\1 & 0 \\1 & 0 \\0 & 1\end{bmatrix},\begin{bmatrix}0 & 1 \\1 & 0 \\0 & 1 \\1 & 0\end{bmatrix},\begin{bmatrix}0 & 1 \\0 & 1 \\1 & 0 \\1 & 0\end{bmatrix}} \right\}}$

Matrix division (e.g., SVD) is performed using these H submatrices. Themixture profile data at each individual locus p (locus) is also used.Proceed as follows:

-   1. Normalize the mixture data p to sum to 2 at each locus.-   2. Find the best two element weighting vector w. On the subset of    most informative loci (typically those loci having four allele    loci):

For each locus,

-   -   For every valid genotype submatrix H at that locus,

compute w(locus,H)=p(locus)\H

normalize w(locus,H) to sum to 1

set dev(locus,H) to norm(p{locus)−H×w(locus,H)′}

Set w=w(locus,H) having the smallest dev(locus,H).

-   3. Derive the genotype H of each locus as follows:

For every valid genotype matrix H_(i) at that locus,

compute dev(H_(i))=norm{p(locus)−H_(i)×w′}

Set H(locus) to that H_(L) having the smallest dev(H_(i)).

-   4. Assess the quality of the genotype result G formed by combining    the (H(locus)) at each locus. This is preferably done by:

(a) using the matrix operation w=G\p to estimate a second mixture weightw, and comparing with the first w found in Step 2, or

(b) by examining the commuted deviations dev(H(locus)) found in Step 3.

Note that the operation “w(locus,H)=p(locus)\H” in this procedure forcomputing the best weight uses a matrix operation (i.e., matrixdivision). This procedure is preferable to finding w by minimizingnorm(p(locus)−H×w(locus,H)) over w using brute force commutation.

Herein, means or mechanism for language has been used. The presence ofmeans is pursuant to 35 U.S.C. § 112 paragraph and is subject thereto.The presence of mechanism is outside of 35 U.S.C. § 112 and is notHerein, means or mechanism for language has been used. The presence ofmeans is pursuant to 35 U.S.C. § 112 paragraph and is subject thereto.The presence of mechanism is outside of 35 U.S.C. § 112 and is notsubject thereto.

Although the invention has been described in detail in the foregoingembodiments for the purpose of illustration, it is to be understood thatsuch detail is solely for that purpose and that variations can be madetherein by those skilled in the art without departing from the spiritand scope of the invention except as it may be described by thefollowing claims.

What is claimed is:
 1. A method for analyzing a nucleic acid samplecomprised of the steps: (a) forming labeled DNA sample fragments from anucleic acid sample; (b) size separating said sample fragments with sizestandard fragments, and detecting the fragments to form a sample signaland a size standard signal; (c) transforming the sample signal into sizecoordinates using the size standard signal; (d) analyzing the nucleicacid sample in size coordinates to form data; and (e) using a computerto apply at least seven different rules to check the data for possibleartifacts, where at least one of the rules compares observed measures ofthe data against expected data behavior.
 2. A method for analyzing anucleic acid sample comprised of the steps: (a) forming labeled DNAsample fragments from a nucleic acid sample; (b) size separating anddetecting said sample fragments to form a sample signal; (c) forminglabeled DNA ladder fragments corresponding to molecular lengths; (d)size separating and detecting said ladder fragments to form a laddersignal; (e) transforming the, sample signal into an allelic ladder sizecoordinate using the ladder signal, where the lengths of the DNA samplefragments are expressed using integer values of base pairs; and (f)analyzing the nucleic acid sample signal in length coordinates.
 3. Amethod as described in claim 2 wherein after the analyzing step (f)there is the additional step of determining a length or amount of afragment in the nucleic acid sample.
 4. A method as described in claim 3wherein after the determining step there is the additional step ofidentifying an individual by DNA profiling.
 5. A system for analyzing anucleic acid sample comprising.: (a) means for forming labeled DNAsample fragments from a nucleic acid sample; (b) means for sizeseparating and detecting said sample fragments to form ,a sample signal,said separating and detecting means in communication with the samplefragments; (c) means for forming labeled DNA ladder, fragmentscorresponding to molecular lengths; (d) means for size separating anddetecting said ladder fragments to form a ladder signal, said separatingand detecting means in communication with the ladder fragments; (e)means for transforming the sample signal into an allelic ladder sizecoordinate system length coordinates using the ladder signal, where thelengths of the DNA sample fragments are expressed using integer valuesof base pairs, said transforming means in communication with thesignals; and (f) means for analyzing the nucleic acid sample signal inlength coordinates, said analyzing means in communication with thetransforming means.